Lab 4. Synchronization
Project goals
This is not meant to be a difficult project, but is intended to force
you to directly engage (at a low level) with a range synchronization
problems. The educational and skill demonstration goals of this
project are:
  - 
    primary: demonstrate the ability to recognize critical sections
    and address them with a variety of different mechanisms.
 
  - 
    primary: demonstrate the existence of the problems and efficacy of
    the subsequent solutions
 
  - secondary: demonstrate the ability to deliver code to meet CLI
  and API specifications.
 
  - secondary: experience with basic performance measurement and
  instrumentation
 
  - secondary: experience with application development, exploiting
  new library functions, creating command line options to control
  non-trivial behavior.
 
  
Assignment overview
  The assignment is divided into three general parts and an optional
  design problem.
  - 
    conflicting read-modify-write operations on a single variable.
 
  - 
    conflicting search and update operations in an ordered doubly
    linked list.
 
  - understanding the crux issue in sleep/wakeup races
 
  - (design problem) correct a significant measurement artifact
 
  In this assignment, you will:
  - 
    implement a few basic operations on shared variables and complex
    data structures.
 
  - implement test programs to exercise those operations.
 
  - 
    demonstrate race conditions resulting from unprotected parallel
    execution.
 
  - implement and demonstrate the effectiveness of a few different
  serialization mechanisms.
 
  - measure the relative performance of these implementations as a
  function of the number of competing threads and operations, and
  explain your results.
 
  - explain the correct handling of a very important critical
  section that was only briefly covered in the reading.
 
  Your deliverables for this assignment will include:
- object modules that build cleanly and implement specified APIs
 
- programs that build cleanly and implement specified CLIs
 
- executions of those programs to demonstrate specified problems and
solutions
 
- timing results and graphs
 
- brief (1 paragraph) explanations of those results
 
To perform this assignment, you will need to learn to exploit a few new tools:
- clock_gettime(2) … high resolution timers
 
- GCC
atomic builtins
 - gnuplot(1) … you may also find it useful to learn to use
this versatile package for producing your graphs.
 
Part 1 – parallel updates to a shared variable
Summary of Deliverables
  - 
the source for a C program and Makefile that cleanly (no warnings)
builds using gcc on an x86-64 GNU/Linux system, implements the (below)
specified command line options, drives one or more parallel threads
that do adds and subtracts to a shared variable, and reports on the
final sum and performance.
  
 - 
graphs of:
    
      - the average time per operation vs the number of iterations
 
      - the average time per operation vs the number of threads for
      all four versions of the add function.
 
    
   - written brief (a few sentences per question) answers to
  questions 1.1, 1.2 and 1.3.
 
Detailed instructions
Start with a basic add routine:
    void add(long long *pointer, long long value) {
        long long sum = *pointer + value;
        *pointer = sum;
    }
Write a test driver program called addtest that:
  - takes a parameter for the number of parallel threads
  (
--threads=#, default 1) 
  - takes a parameter for the number of iterations
  (
--iterations=#, default 1) 
  - initializes a (long long) counter to zero.
 
  - notes the (high resolution) starting time for the run (using
  clock_gettime(2))
 
  - starts the specified number of threads, each of which will use
  the above add function to:
  
    - add 1 to the counter the specified number of times
 
    - add −1 to the counter the specified number of times
 
    - exits to re-join the parent thread
 
  
   
  - wait for all threads to complete, and notes the (high
  resolution) ending time for the run.
 
  - checks to see if the counter value is zero, and if not log an
  error message to stderr
 
  - prints to stdout
  
    - the total number of operations performed
 
    - the total run time (in nanoseconds), and the average time per
    operation (in nanoseconds).
 
  
 
  - If there were no errors, exit with a status of zero, else a
  non-zero status
 
suggested sample output:
  $ ./addtest --iter=10000 --threads=10
  10 threads x 10000 iterations x (add + subtract) = 200000 operations
  ERROR: final count = 374
  elapsed time: 6574000 ns
  per operation: 2 ns
Run your program for a range of threads and iterations values, and
note how many threads and iterations it takes to (fairly consistently)
result in a failure (non-zero sum).
QUESTIONS 1.1:
  - Why does it take this many threads or iterations?
 
  - Why does a significantly smaller number of iterations so seldom
  fail?
 
Extend the basic add routine to be more easily cause the failures:
    int opt_yield;
    void add(long long *pointer, long long value) {
        long long sum = *pointer + value;
        if (opt_yield)
            pthread_yield();
        *pointer = sum;
    }
And add a new --yield=1 option to your driver program
that sets opt_yield to 1. Re-run your tests and see how many
iterations and threads it takes to (fairly consistently) result in a
failure.  It should now be much easier to cause the failures.
Compare the average execution time of the yield and non-yield
versions with different numbers of threads and iterations.  You should
note that the --yield runs are much slower than the
non-yield runs … even with a single thread.
Graph the average cost per operation (non-yield) as a function of
the number of iterations, with a single thread.  You should note that
the average cost per operation goes down as the number of iterations
goes up.
QUESTIONS 1.2
  - Why does the average cost per operation drop with increasing
  iterations?
 
  - How do we know what the “correct” cost is?
 
  - Why are the --yield runs so much slower?  Where is the extra
  time going?
 
  - Can we get valid timings if we are using --yield?  How, or why not?
 
Implement three new versions of the add function:
  - one protected by a pthread_mutex, enabled by a
  new 
--sync=m option. 
  - one protected by a spin-lock, enabled by a
  new 
--sync=s option.  You will have to implement your
  own spin-lock operation.  We suggest that you do this using the GCC
  atomic __sync_ builtin
  functions __sync_lock_test_and_set
  and __sync_lock_release.
   - one that performs the add using the GCC
  atomic 
__sync_ builtin
  function __sync_val_compare_and_swap to ensure atomic
  updates to the shared counter, enabled by a
  new --sync=c option.
 
Use your yield option to confirm that (even for large numbers of
threads and iterations) all three of these serialization mechanisms
eliminates the race conditions in the add critical section.
Using a large enough number of iterations to overcome the issues
raised in the previous section, note the average time per operation
for a range of threads= values, and graph the performance
of all four versions (unprotected, mutex, spin-lock, compare-and-swap)
vs the number of threads.
QUESTIONS 1.3
  - Why do all of the options perform similarly for low numbers of
  threads?
 
  - Why do the three protected operations slow down as the number of
  threads rises?
 
  - Why are spin-locks so expensive for large numbers of threads?
 
Part 2 – parallel updates to complex data structures
Summary of Deliverables
  - the source for a C module and Makefile that cleanly (no
  warnings) builds using gcc on an x86-64 GNU/Linux system, and
  implements insert, delete, lookup, and length methods for a sorted
  doubly linked list (described in a provided header file, including
  correct placement of pthread_yield calls).
 
  - the source for a C program and Makefile that cleanly (no
  warnings) builds using gcc on an x86-64 GNU/Linux system, implements
  the (below) specified command line options, drives one or more
  parallel threads that do operations on a shared linked list, and
  reports on the final list and performance.
 
  - graphs of:
  
    - average time per unprotected operation vs number of iteration
    (single thread)
 
    - (corrected) average time per operation (for none, mutex, and
    spin-lock) vs number of threads.
 
    - (corrected) average time per operation (for none, mutex, and
    spin-lock) vs the ratio of threads per list.
 
  
o 
  - written brief (a few sentences per question) answers to
  questions 2.1, 2.2 and 2.3.
 
Detailed instructions
Review the interface specifications for a sorted doubly linked list
package described in the header
file SortedList.h, and
implement all four those methods.  Note that the interface includes
three software-controlled yield options.  Identify the critical
section in each of your four methods, and add calls to pthread_yield,
controlled by the yield options:
  - in 
SortedList_insert if opt_yield &
  INSERT_YIELD 
  - in 
SortedList_delete if opt_yield &
  DELETE_YIELD 
  - in 
SortedList_lookup if opt_yield &
  SEARCH_YIELD 
  - in 
SortedList_length if opt_yield &
  SEARCH_YIELD 
These force a switch to another thread at the critical point in
each method.
Write a test driver program (you can start with your driver from
part 1) called sltest that:
  - takes a parameter for the number of parallel threads
  (
--threads=#, default 1) 
  - takes a parameter for the number of iterations
  (
--iterations=#, default 1) 
  - takes a parameter to enable the optional critical section yields
  (
--yield=[ids], i for insert, d for delete, and s for
  searches) 
  - initializes an empty list.
 
  - creates and initializes (with random keys) the required number
  (threads × iterations) of list elements.  We do this before
  creating the threads so that this time is not included in our
  start-to-finish measurement.
 
  - notes the (high resolution) starting time for the run (using
  clock_gettime(2))
 
  - starts the specified number of threads, each of which will use
  the above list function to
  
    - insert each of its elements (iterations parameter) into the
    list
 
    - gets the list length
 
    - look up each of the keys it inserted
 
    - deletes each returned element from the list
 
    - exits to re-join the parent thread
 
  
 
  - wait for all threads to complete, and notes the (high
  resolution) ending time for the run.
 
  - checks the length of the list to confirm that it is zero, and
  logs an error to stderr if it is not.
 
  - prints to stdout
  
    - the number of operations performed
 
    - the total run time (in nanoseconds), and the average time per
    operation (in nanoseconds).
 
  
 
  - exits with a status of zero if there were no errors, otherwise
  non-zero
 
Suggested sample output:
    $ ./sltest --threads=10 --iterations=1000 --lists=10 --sync=m 10
    threads x 100 iterations x (ins + lookup/del) x (100/2 avg len) = 100000 operations
    elapsed time: 4225760 ns
    per operation: 4 ns
Run your program with a single thread, and increasing numbers of
iterations, and note the average time per operation.  These results
should be quite different from what you observed when testing your add
function with increasing numbers of iterations.  Graph the time per
operation vs the number of iterations (for
a --threads=1).
QUESTIONS 2.1
  - Explain the variation in time per operation vs the number of
  iterations?
    How would you propose to correct for this effect?
 
Run your program and see how many parallel threads and iterations
it takes to fairly consistently demonstrate a problem.  Note that even
if you check for most inconsistencies in the list, your program may
still experience segmentation faults when running multi-threaded
without synchronization.  Then run it again using various combinations
of yield options and see how many threads and iterations it takes to
fairly consistently demonstrate the problem.  Make sure that you can
demonstrate:
  - conflicts between inserts (
--yield=i) 
  - conflicts between deletes (
--yield=d) 
  - conflicts between inserts and lookups
  (
--yield=is) 
  - conflicts between deletes and lookups
  (
--yield=ds) 
Add two new options to your program to call two new versions of
these methods: one set of operations protected by pthread_mutexes
(--sync=m), and another protected by test-and-set spin
locks (--sync=s).  Using your --yield
options, demonstrate that either of these protections seem to
eliminate all of the problems, even for large numbers of threads and
iterations.
Rerun your program without the yields, and choose an appropriate
number of iterations (or apply the correction you identified in
response to question 2.1).  Note that you will only be able to run the
unprotected method for a single thread.  Note the execution times for
the original and both new protected methods.  Graph the (corrected)
per operation times (for each of the three synchronization options:
unprotected, mutex, spin) vs the number of threads.
QUESTIONS 2.2
  - Compare the variation in time per protected operation vs the
  number of threads in Part 2 and in Part 1.  Explain the
  difference.
 
Add a new --lists=# option to your test driver
program:
  - break the single (huge) sorted list into the specified number of
  sub-lists (each with its own list header and synchronization
  object.
 
  - change your test driver to select which sub-list a particular
  key should be in based on a simple hash of the key, modulo the
  number of lists.
 
  - figure out how to (safely and correctly) re-implement the length
  function, which now needs to enumerate all of the sub-lists.
 
Rerun all three versions (unprotected, spin, mutex) of your program
(without yields) for a range of --lists= values.  Note
that you will only be able to run the unprotected version for a single
thread.  Graph the per operation times (for each of the three
synchronization options) vs the ratio of threads to lists.
QUESTIONS 2.3
  - Explain the the change in performance of the synchronized
  methods as a function of the number of threads per list.
 
  - Explain why threads per list is a more interesting number than
  threads (for this particular measurement).
 
Part 3 – sleep/wakeup races
Summary of Deliverables
  - brief (a few sentences per question) answers to questions
  3-1
 
The pthread_cond_wait operation takes a mutex as a
parameter, and automatically/atomically releases it after putting the
process to sleep, and reacquires it before allowing the process to
resume.
QUESTIONS 3-1
  - Why must the mutex be held when pthread_cond_wait is
  called?
 
  - Why must the mutex be released when the waiting thread is
  blocked?
 
  - Why must the mutex be reacquired when the calling thread
  resumes?
 
  - Why must this be done inside of pthread_cond_wait?  Why can’t
  the caller simply release the mutex before calling
  pthread_cond_wait?
 
  - Can this be done in a user-mode implementation of
  pthread_cond_wait?  If so, how?  If it can only be implemented by a
  system call, explain why?
 
Part 4 – Design Problem – Artifact Correction
At the beginning of Questions 1-2 you explored a measurement
artifact: an error in our time per operation computation that was
introduced by the way in which we implemented the test driver.  In
part 1 you used a large number of iterations to dilute the artifact
per operation.  That isn’t a bad approach, but a better approach would
be to eliminate the artifact (avoid measuring it, or find a way to
subtract it from our results).
Can you design and implement a change to the measurement program to
eliminate this artifact?  Having done so, what data can you generate
to testify to the elimination (or at least substantial reduction) of
that artifact?
If you choose to do this:
  - add a Part 4 in which you explain your approach.
 
  - add a --correct switch to your 
addtest
  and sltest programs that enables your artifact
  correction. 
  - We will review your design and test the effectiveness of your
  correction.
 
 © 2016 Mark Kampe. All rights reserved.
 See copying rules.
 $Id: lab4.html,v 1.3 2016/03/02 19:38:11 eggert Exp $