In the previous project the mutex and spin-lock proved to be bottlenecks, preventing parallel access to the list. In this project, you will do additional performance instrumentation to confirm this, and extend your previous solution to deal with this problem. This project can be broken up into a few major steps:
Partitioned lists and finer granularity locking are discussed in sections 29.2-4
You are free to implement methods beyond those described in the provided SortedList.h, but you cannot make any changes to that header file. This header file describes the interfaces that you are required to implement.
To perform this assignment, you will need to research, choose, install and master a multi-threaded execution profiling tool. Execution profiling is a combination of compile-time and run-time tools that analyze a program's execution to determine how much time is being spent in which routines. There are three standard Linux profiling tools:
This project is about scalable parallelism, which is only possible on a processor with many cores. You can do most of your development and testing on any Linux system, but if your personal computer does not have more than a few cores (e.g. 8-12), you will not be able to do real multi-threaded performance testing on it. Lab servers are available if you need access to a larger machine for your final testing, graph production and performance analysis.
If you are testing on lab servers, you may have to install your own private copies of the gperftools. This means that the paths to those tools will be different on different machines, and greatly complicates creating a profile rule that will work on other machines. For this reason, we will not run your profile rule during testing. Rather we will merely review your submitted profiling report and confirm that it corresponds to your code.
Review your results from the previous lab (lab2_add-5.png and lab2_list-4.png) for the cost (per synchronized operation) vs. the number of threads. For Compare-and-Swap and Mutex, we saw that adding threads took us from tens of ns per operation to small hundreds of ns per operation. Looking at the analogous results in Part 2, we see the (un-adjusted for length) time per operation go from a few microseconds, to tens of microseconds. For the adds, moderate contention added ~100ns to each synchronization operation. For the list operations, moderate contention added ~10us to each synchronization operation. This represents a 100x difference in the per operation price of lock contention.
In a single-threaded implementation, time per operation is a very reasonable performance metric. But for multi-threaded implemenations we are also concerned about how well we are taking advantage of the available parallelism. This is more clearly seen in the aggregate throughput (total operations per second for all threads combined). Run your list exerciser with:
You do not need to go back to your lab2a_list program to generate this data, because the lab2b_list program (even after adding all the new features) should still be able to generate essentially the same results.
The most obvious difference, which we already knew:
The obvious conclusions (from both the cost-per-operation graphs you produced previously, and the throughput graphs you have just produced) are:
Where do you believe most of the CPU time is spent in the 1 and 2-thread list tests ?
Why do you believe these to be the most expensive parts of the code?
Where do you believe most of the CPU time is being spent in the high-thread spin-lock tests?
Where do you believe most of the CPU time is being spent in the high-thread mutex tests?
It should be clear why the spin-lock implementation performs so badly with a large number of threads. But the mutex implementation should not have this problem. You may have some theories about why these algorithms scale so poorly. But theories are only theories, and the prime directive of performance analysis is that theoretical conclusions must be confirmed by hard data.
Build your program with debug symbols, choose your execution profiling package, install it, run the spin-lock list test (1,000 iterations 12 threads) under the profiler, and analyze the results to determine where the CPU time is being spent.
The default output from google-pprof will show you which routine is consuming most of the CPU time. If you then re-run google-pprof with the --list option (specifying that routine), it will give you a source-level breakdown of how much time is being spent on each instruction. You should get a very clear answer to the question of where the program is spending its time. Update your Makefile to run this test and generate the results automatically (make profile), include your profiling report (in a file named profile.out) in your submitted tarball, and identify it in your README file.
Where (what lines of code) are consuming most of the CPU time when the spin-lock version of the list exerciser is run with a large number of threads?
Why does this operation become so expensive with large numbers of threads?
In the spin-lock case, profiling can tell us where we are spending most of our time. In the mutex case, we are not spinning. A thread that cannot get the lock is blocked, and not consuming any CPU time. Profiling only tells us what code we are executing. It doesn't tell us anything about the time we spend blocked. How could we confirm that, in the mutex case, most threads are spending most of their time waiting for a lock?
Update your mutex and spin-lock implementations to:
Run the list mutex test again for 1,000 iterations and 1, 2, 4, 8, 16, 24 threads, and plot the wait-for-lock time, and the average time per operation against the number of competing threads. Submit this graph lab2b_2.png.
While the details of how contention degrades throughput are different for mutex and spin-lock synchronization, all of the degradation is the result of increased contention. This is the fundamental problem with coarse-grained synchronization. The classic solution to this problem is to partition the single resource (in this case a linked list) into multiple independent resources, and divide the requests among those sub-resources.
Add a new --lists=# option to your lab2_list program:
The supported command line options and expected output are illustrated below:
% ./lab2_list --threads=10 --iterations=1000 --lists=5 --yield=id --sync=m list-id-m,10,1000,5,30000,85811144,2860,20580 %
Confirm the correctness of your new implementation:
Now that we believe the partitioned lists implementation works, we can measure its performance:
QUESTION 2.3.4 - Performance of Partitioned Lists
Your README file must include lines of the form:
Your tarball should have a name of the form lab2b-studentID.tar.gz.
You can sanity check your submission with this
test script.
Projects that do not pass the test script will not be accepted!
We will test it on a departmental Linux server. You would be well advised to test your submission on that platform before submitting it.
Points for this project will be awarded:
value | feature |
---|---|
Packaging and build (10% total) | |
2% | un-tars expected contents |
3% | clean build of correct programs w/default action (no warnings) |
3% | Makefile has working clean, dist, tests, profile and graphs targets |
2% | reasonableness of README contents |
Code Review (20%) | |
4% | overall readability and reasonableness |
4% | multi-list implementation |
4% | thread correctly sums up the length across sub-lists |
4% | mutex use on multi-lists |
4% | spin-lock use on multi-lists |
Results (40% total) ... produces correct output for | |
5% | lists |
5% | correct mutex |
5% | correct spin |
5% | reasonable time per operation reporting |
5% | reasonable wait for mutex reporting |
10% | graphs (showed what we asked for) |
5% | profiling report (clearly shows where CPU time went) |
Analysis (30% total) ... reasonably explained all results in README | |
2% | General clarity of thought and understanding |
7% | 2.3.1 where the CPU time goes |
7% | 2.3.2 profiling |
7% | 2.3.3 wait time |
7% | 2.3.4 list partitioning |
Note: if your program does not accept the correct options or produce the correct output, you are likely to receive a zero for the results portion of your grade. Look carefully at the sample commands and output. If you have questions, ask.