Since threads share memory space, it is quite easy to write buggy code if memory access precautions are not taken into account. The system or compiler make no checks as to how data is being read and written, thus race conditions are guaranteed to appear when one thread reads a value while another is writing to it.
We are given the task of designing a banking system for Bill Gates to use. He can deposit or withdraw money at any time he wishes, so long as he has the proper funds to do so. Here is our initial (and naive) approach:
// Money stored in pennies // This should be large enough to encompass Bill's fortune long long int balance = 0; void deposit(long long int amt) { balance += amt; } bool withdraw(long long int amt) { if(amt <= balance) { balance -= amt; return true } return false; }
On a single thread, this code will run fine, however, let's suppose that Bill tries to deposit $100 (or 10000 pennies) into his account at an ATM, while his wife tries to withdraw $50 dollars from another machine.
Depending on the order that the two threads are executing we can observe
unexpected results: If, for example, the deposit thread adds the $100 to
the deposit while the withdraw thread is removing $50, the deposit can
be erased all together. Looking at the assembly code we can see that the
compiler moves the value in balance
into a register, subtracts
amt
and stores it back. The increased value from the deposit
thread has completely been blown away.
We can use the volatile
to tell the compiler not to optimize
access to balance
and reload it's value before any computation.
This does not fix the issue, as the computer cannot perform a computation on
any value without first caching it into a register anyway. An interrupt can
occur at any moment, even on single core machines, and another thread can
update the value before the first thread has a chance to finish.
One approach to our problem would be to introduce machine instructions which mark the beginning and end of critical sections, or portions of code where the thread executing them should not be interrupted or the context switched. Several problems may arise from this approach, however:
We define our own pipes which will be used in our application internally, unlike the pipes which reside in kernel space and operate between proceses. Our pipes will work as circular buffers, and are limited to single byte I/O. Here is our initial implementation:
enum { N = 1024 }; // R is the total number of bytes read // W is the total number of bytes written // 0 <= R <= W // r = R % (1<<16) // w = W % (1<<16) struct pipe { char buf[N]; unsigned r, w; }; void writec(struct pipe *p, char c) { while(pipe_full(p)) continue; p->buf[p->w++ % N] = c; } // readc could (if pipe is empty) wait or return -1. // In this implementation we will wait. char readc(struct pipe *p) { while(pipe_empty(p)) continue; return p->buf[p->r++ % N]; } bool pipe_empty(struct pipe *p) { return p->w == p->r; } // C black magic ensures that unsigned subtraction // when one number has overflowed, will also overflow // giving the correct answer. bool pipe_full(struct pipe *p) { return p->w - p->r == N; }
It can be easily seen that this code is vulnerable to race conditions. If, for example, there is a single byte of space left for writing in the pipe. If two threads simultaneously decide to write a single byte into the pipe it would overwrite data. To deal with this, we'll introduce a way of locking code through some primitives, whose implementation we will observe later:
// Rules of engagement: // 1. It is the caller's responsibility of performing // a small task after a lock, and promptly unlocking after // 2. Do not lock something that's already been locked // 3. Do not unlock a resource that hasn't been locked // Otherwise BadThings™ will happen typedef ... spinlock_t; void lock(spinlock_t *l); void unlock(spinlock_t *l);
Now that we have a mechanism for identifying critical sections we can make our code safer for parallelization. Unfortunately the tools for this task are not sophisticated enough to perform the work for us, so we must look at the code, identify possible race conditions and mark critical sections by hand.
Much like Goldielocks we want to find a critical section of just the right size. If it is too large, then we violate the rule of bounded wait, and any other threads end up waiting for too long. If it is too small, the race conditions will not be encompassed as if no action was taken to prevent them.
To determine the right size of a critical section we need to take into account shared read and writes of data. If every thread only reads a particular data source then we need not worry about it spontaneously changing. Writing data introduces race contidions when there are gaps between when it is written and when it is read.
Let's update our code with using our spinlock_t
type and
keep in mind the size of the critical sections.
void writec(struct pipe *p, char c) { for(;;) { lock(&l); if(!pipe_full(p)) break; unlock(&l); } p->buf[p->w++ % N] = c; unlock(&l); } char readc(struct pipe *p) { for(;;) { lock(&l); if(!pipe_empty(p)) break; unlock(&l); } char c = p->buf[p->r++ % N]; unlock(&l); return c; } bool pipe_empty(struct pipe *p) { return p->w == p->r; } bool pipe_full(struct pipe *p) { return p->w - p->r == N; }
Say our application now has 1000 pipes and 5000 threads accessing them. Since we have one global lock used for the reading and writing of any pipe, this will bottle neck our program: one thread will run and 4999 of them will wait for the first to finish. If we use a finer grain of locks, rather than one coarse one, we will need to manage more locks, but we gain the benefit of locking up less data.
The pipe data structure can be rewritten to hold a lock pertaining only to that pipe.
Note: in C the ->
operator binds tighter than the &
operator.
struct pipe { char buf[N]; unsigned r, w; spinlock_t l; }; void writec(struct pipe *p, char c) { for(;;) { lock(&p->l); if(!pipe_full(p)) break; unlock(&p->l); } p->buf[p->w++ % N] = c; unlock(&p->l); } char readc(struct pipe *p) { for(;;) { lock(&p->l); if(!pipe_empty(p)) break; unlock(&p->l); } char c = p->buf[p->r++ % N]; unlock(&p->l); return c; } bool pipe_empty(struct pipe *p) { return p->w == p->r; } bool pipe_full(struct pipe *p) { return p->w - p->r == N; }
We still have a problem with our implementation, and that is that we are
wasting lots of CPU cycles spinning and waiting for the resources to be
freed. We can solve this by yielding to the kernel (via system call or a
yield()
primitive). Using this technique, however, a problem
arises if a single thread is performing a large task: it gets slowed down
by the rest of the threads having to wake up, see that the resources are
blocked, and going back to sleep.
The task can be approached by modifying our spin locks to be blocking locks instead; we want to use a blocking mutex. Our new locking interface takes the form
// The rules for locking are the same as before, except // now threads are blocked instead of continuously spinning // Rules of engagement: // 1. It is the caller's responsibility of performing // a small task after a lock, and promptly unlocking after // 2. Do not lock something that's already been locked // 3. Do not unlock a resource that hasn't been locked // Otherwise BadThings™ will happen typedef ... bmutex_t; // get the lock but wait if already locked void acquire(bmutex_t *b); void release(bmutex_t *b); bool try_acquire(bmutex_t *b);
Typically, spin locks are faster than mutexes, however, they are less efficient since they use up CPU cycles while waiting. Generally the strategy is to use spin locks when resources are usually free and rarely in use. Mutexes are used in the converse situation.
How do we actually implement the primitives for spin locks and mutexes? That is a lesson for another time...