Lab 4. Synchronization

Project goals

This is not meant to be a difficult project, but is intended to force you to directly engage (at a low level) with a range synchronization problems. The educational and skill demonstration goals of this project are:

Assignment overview

The assignment is divided into three general parts and an optional design problem.

  1. conflicting read-modify-write operations on a single variable.
  2. conflicting search and update operations in an ordered doubly linked list.
  3. understanding the crux issue in sleep/wakeup races
  4. (design problem) correct a significant measurement artifact

In this assignment, you will:

Your deliverables for this assignment will include:

To perform this assignment, you will need to learn to exploit a few new tools:

Part 1 – parallel updates to a shared variable

Summary of Deliverables

Detailed instructions

Start with a basic add routine:

    void add(long long *pointer, long long value) {
        long long sum = *pointer + value;
        *pointer = sum;
    }

Write a test driver program called addtest that:

suggested sample output:

  $ ./addtest --iter=10000 --threads=10
  10 threads x 10000 iterations x (add + subtract) = 200000 operations
  ERROR: final count = 374
  elapsed time: 6574000 ns
  per operation: 2 ns

Run your program for a range of threads and iterations values, and note how many threads and iterations it takes to (fairly consistently) result in a failure (non-zero sum).

QUESTIONS 1.1:

  1. Why does it take this many threads or iterations?
  2. Why does a significantly smaller number of iterations so seldom fail?

Extend the basic add routine to be more easily cause the failures:

    int opt_yield;

    void add(long long *pointer, long long value) {
        long long sum = *pointer + value;
        if (opt_yield)
            pthread_yield();
        *pointer = sum;
    }

And add a new --yield=1 option to your driver program that sets opt_yield to 1. Re-run your tests and see how many iterations and threads it takes to (fairly consistently) result in a failure. It should now be much easier to cause the failures.

Compare the average execution time of the yield and non-yield versions with different numbers of threads and iterations. You should note that the --yield runs are much slower than the non-yield runs … even with a single thread.

Graph the average cost per operation (non-yield) as a function of the number of iterations, with a single thread. You should note that the average cost per operation goes down as the number of iterations goes up.

QUESTIONS 1.2

  1. Why does the average cost per operation drop with increasing iterations?
  2. How do we know what the “correct” cost is?
  3. Why are the --yield runs so much slower? Where is the extra time going?
  4. Can we get valid timings if we are using --yield? How, or why not?

Implement three new versions of the add function:

Use your yield option to confirm that (even for large numbers of threads and iterations) all three of these serialization mechanisms eliminates the race conditions in the add critical section.

Using a large enough number of iterations to overcome the issues raised in the previous section, note the average time per operation for a range of threads= values, and graph the performance of all four versions (unprotected, mutex, spin-lock, compare-and-swap) vs the number of threads.

QUESTIONS 1.3

  1. Why do all of the options perform similarly for low numbers of threads?
  2. Why do the three protected operations slow down as the number of threads rises?
  3. Why are spin-locks so expensive for large numbers of threads?

Part 2 – parallel updates to complex data structures

Summary of Deliverables

Detailed instructions

Review the interface specifications for a sorted doubly linked list package described in the header file SortedList.h, and implement all four those methods. Note that the interface includes three software-controlled yield options. Identify the critical section in each of your four methods, and add calls to pthread_yield, controlled by the yield options:

These force a switch to another thread at the critical point in each method.

Write a test driver program (you can start with your driver from part 1) called sltest that:

Suggested sample output:

    $ ./sltest --threads=10 --iterations=1000 --lists=10 --sync=m 10
    threads x 100 iterations x (ins + lookup/del) x (100/2 avg len) = 100000 operations
    elapsed time: 4225760 ns
    per operation: 4 ns

Run your program with a single thread, and increasing numbers of iterations, and note the average time per operation. These results should be quite different from what you observed when testing your add function with increasing numbers of iterations. Graph the time per operation vs the number of iterations (for a --threads=1).

QUESTIONS 2.1

Run your program and see how many parallel threads and iterations it takes to fairly consistently demonstrate a problem. Note that even if you check for most inconsistencies in the list, your program may still experience segmentation faults when running multi-threaded without synchronization. Then run it again using various combinations of yield options and see how many threads and iterations it takes to fairly consistently demonstrate the problem. Make sure that you can demonstrate:

Add two new options to your program to call two new versions of these methods: one set of operations protected by pthread_mutexes (--sync=m), and another protected by test-and-set spin locks (--sync=s). Using your --yield options, demonstrate that either of these protections seem to eliminate all of the problems, even for large numbers of threads and iterations.

Rerun your program without the yields, and choose an appropriate number of iterations (or apply the correction you identified in response to question 2.1). Note that you will only be able to run the unprotected method for a single thread. Note the execution times for the original and both new protected methods. Graph the (corrected) per operation times (for each of the three synchronization options: unprotected, mutex, spin) vs the number of threads.

QUESTIONS 2.2

Add a new --lists=# option to your test driver program:

Rerun all three versions (unprotected, spin, mutex) of your program (without yields) for a range of --lists= values. Note that you will only be able to run the unprotected version for a single thread. Graph the per operation times (for each of the three synchronization options) vs the ratio of threads to lists.

QUESTIONS 2.3

Part 3 – sleep/wakeup races

Summary of Deliverables

The pthread_cond_wait operation takes a mutex as a parameter, and automatically/atomically releases it after putting the process to sleep, and reacquires it before allowing the process to resume.

QUESTIONS 3-1

Part 4 – Design Problem – Artifact Correction

At the beginning of Questions 1-2 you explored a measurement artifact: an error in our time per operation computation that was introduced by the way in which we implemented the test driver. In part 1 you used a large number of iterations to dilute the artifact per operation. That isn’t a bad approach, but a better approach would be to eliminate the artifact (avoid measuring it, or find a way to subtract it from our results).

Can you design and implement a change to the measurement program to eliminate this artifact? Having done so, what data can you generate to testify to the elimination (or at least substantial reduction) of that artifact?

If you choose to do this:


© 2016 Mark Kampe. All rights reserved. See copying rules.
$Id: lab4.html,v 1.3 2016/03/02 19:38:11 eggert Exp $