You're working for a web server analysis firm that regularly needs to look for patterns in URLs in real-time. Many of the URLs are quite long, and the pattern analysis can be complicated. Your firm will have an advantage over your competitors if you can match individual URLs faster than your competition. Your company can easily afford to throw hardware at the problem, so your boss asks you whether you can reprogram your servers to do the pattern analysis in parallel, taking advantage of the fact that your servers are multiprocessors.
Some of the patterns are quite simple: so simple that they can be represented as file name patterns like x*.c and x?*[a-z]*.h. These patterns are used by the C-language glob function standardized by POSIX. This notation is a subset of the usual regular expression syntax. A large fraction of these patterns use only the following features:
You have no need to support other POSIX features, including special treatment of . or of / or of bracket expressions that contain collating symbols such as [.ch.], equivalence class expressions such as [=a=], and character class expressions such as [:alpha:].
A colleague of yours has written a prototype matcher called Glob, and has given you its source code. You look into it, and discover that, while it works well in some cases, it has two problems. First, it doesn't support negated bracket expressions such as [!a-z]. Second, it's slow when given certain complicated patterns.
Modify Glob to address these two problems. First, add support for negated bracket expressions: this is pretty much just a warmup to get you used to the code. Second, add support for globbing in parallel, so that multiple threads can explore different ways of matching a pattern against a single string.
To implement the latter, define a new constructor Glob(pattern, n) that acts just like the existing Glob(pattern) constructor, except that it uses at most n threads to solve the problem. Currently, when matching *, the code uses a loop to try alternatives one by one; your modification should use parallelism to try alternatives in parallel, subject to the constraint that it should never use more than n threads total. If n is less than 2, Glob(pattern, n) should act just like Glob(pattern). Like the existing implementation, your implementation should be greedy: when matching *, if there are (say) five alternatives and you have only three threads, they should first be assigned to the three longest matches for *. If you have more threads than alternatives, assign one thread to each alternative, and assign all the remaining threads to the longest alternative.
When a thread exhausts an alternative, your code should try another unexplored alternative, either by having the thread exit and then creating a new thread, or by having the thread turn its attention directly to the unexplored alternative. If more than one unexplored alternative is available, the code should try the one that has the longest match among the most-recently generated unexplored alternatives.
When a thread finds a match, the other threads should stop looking so as to not waste CPU cycles.
In your test cases, compare the performance using 1, 2, and 8 threads, on the SEASnet Linux servers; also, compare the performance to the single-threaded version that merely adds support for negated ranges. Use shell commands like the following to measure the performance:
time java Glob
Add at least one other benchmark test case, of your own design.
Assess your work by writing an after-action report that summarizes the performance that you observed. Focus in particular on any problems you foresee as pattern and string sizes and the number of threads scale up, and which method you expect to work better in general. This report should be at most three pages. ASCII text is fine, or you can submit a PDF file if you prefer.
Your implementation should operate correctly under Java Standard Edition 6 Update 22. There is no need to run on older Java versions. Your program should compile cleanly, without any warnings.
If your PATH is set correctly to a string starting with "/usr/local/cs/bin:", the command "java -version" should output the following text:
java version "1.6.0_22" Java(TM) SE Runtime Environment (build 1.6.0_22-b04) Java HotSpot(TM) Server VM (build 17.1-b03, mixed mode)
To turn in your assignment, submit a single jar file hw3.jar that contains both your Java source code and a plain text file README.txt or a PDF file README.pdf that holds your assessment. Do not submit class files. Also, please do not put your name, student ID, or other personally identifying information in your files. Before submitting hw3.jar you should test it using the following shell commands on SEASnet:
# Make a fresh directory and change into it. mkdir testdir cd testdir # Extract your program. jar xf ../hw3.jar # Make sure you have a README.txt or README.pdf file. if test -f README.txt; then ls -l README.txt else ls -l README.pdf fi # Compile your modified code. javac `find . -name '*.java'` # Run your tests java Glob