Homework 3. Multithreaded gzip compression filter

Background

You're working for a web server house that regularly needs to generate some fairly large data streams and compress them using gzip. The compression part of this is turning into a CPU bottleneck and the users are starting to complain. Your company can easily afford to throw hardware at the problem, so your boss asks you whether you can reprogram your servers to do the compression in parallel, taking advantage of the fact that your servers are multiprocessor.

You look into the matter, and discover that there's a C implementation of a program called pigz that does something along the lines that you want. (For convenience, a copy of this program is available on Seasnet in the /usr/local/cs/src/pigz directory, and an executable is installed in /usr/local/cs/bin/pigz.) The pigz program can be used as a filter that reads programs from standard input and writes to standard output. Unfortunately, it has a problem: it's written in C. Your company has standardized on Java, so that it can use just one version of each executable and run it on a wide variety of servers that you have: some are x86-64, some are x86, some Sparc, and some use secret hardware whose nature isn't disclosed even to your group.

You tell this to your boss, who responds, "OK, so do what pigz is doing, but do it in Java". Her suggestion is to use standard Java classes and see how well your substitute does, compared to standard pigz.

Luckily for you, the gzip format has the property that you can partition an input stream, compress each partition separately, and concatenate the compressed versions of each partition; the resulting compressed stream can be decompressed by pigz, by gzip, or by any other gzip-format-understanding program.

Assignment

Write a Java program called JavaPigz that behaves like the C pigz implementation, in the sense that it operates with multiple compression threads to improve wall-clock performance. Each compression thread acts on an input data block separately from the other input blocks; the compressed output blocks are generated in the same order that their uncompressed blocks were input. The input block size defaults to 128 KiB, but the user can override this. The number of compression threads defaults to the number of available processors; this also can be overridden. Your program may also use a small, fixed number of threads to control the compression threads or to do input/output.

Your implementation can be a simplification of pigz, in the following ways:

JavaPigz should behave like pigz in the following respects:

Measure the performance of the pigz, compared to your JavaPigz, and compare both to /usr/bin/gzip. For your measurement platform, use the Seasnet Linux servers. Use shell commands like the following to compare the performance of the three implementations:

input=/usr/include/c++/3.4.6/x86_64-redhat-linux/bits/stdc++.h.gch/O2g
time gzip <$input >gzip.gz
time pigz <$input >pigz.gz
time java JavaPigz <$input >JavaPigz.gz

# This checks JavaPigz's output.
gzip -d <JavaPigz.gz | cmp - $input

Try at least one other benchmark as well, of your own design. See what happens if the block size or the number of processors is changed to values other than the default. Run each trial at least three times, and report each instance of real time, user time, and system time; also, report the compression ratio of each command.

Assess your work by writing an after-action report that summarizes the performance that you observed. Focus in particular on any problems you foresee as the file size and the number of threads scales up, and which method you expect to work better in general. This report should be a simple ASCII text file that consumes at most three pages.

Your implementation should operate correctly under Java Standard Edition 6 Update 20. There is no need to run on older Java versions. Your program should compile cleanly, without any warnings.

If your PATH is set correctly to a string starting with "/usr/local/cs/bin:", the command "java -version" should output the following text:

java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Server VM (build 16.3-b01, mixed mode)

the command "pigz --version" should output "pigz 2.1.6", and the command "gzip --version" should output "gzip 1.3.5" followed by some licensing information.

Submit

To turn in your assignment, submit a single jar file hw3.jar that contains both your Java source code and a plain text file README.txt that holds your assessment. Do not submit class files. Also, please do not put your name, student ID, or other personally identifying information in your files. Before submitting hw3.jar you should test it using the following shell commands on SEASnet:

# Make a fresh directory and change into it.
mkdir testdir
cd testdir

# Extract your program.
jar xf ../hw3.jar

# Make sure you have a README.txt file.
ls -l README.txt

# Build your modified version.
javac `find . -name '*.java'`

# Check your modified version; the output should be empty.
cat ../hw3.jar | java JavaPigz | gzip -d | cmp - ../hw3.jar

Hints

Here are some library references and tips that may help.


© 2010 Paul Eggert. See copying rules.
$Id: hw3.html,v 1.55 2010/04/20 07:16:05 eggert Exp $