CS 111, Lec 9:

Scribe Notes

By: Steven Weiss

Is it Atomic?
Implementing a Lock on x86
1. Aside: Test and Set
Case Study: Locking a Pipe
Lock Granularity
The Goldilocks Principle
Bounded Waits
1. Blocking Mutexes

Asides

Software Lock Algorithms
Test-and-Set
Atomic Fission
Deadlock

Is it Atomic?

Can we build locks from load and store? That is to say, are load and store atomic?

One might think that is always the case that both are atomic. However, when a load and store are executed at the same time to the same object, it gets a little tricky; it may be entirely possible that said load could get complete garbage.

So we'd like to know, under what conditions can we guarantee that a given load or store will be atomic? This brings us to our first game of the evening, "Is it Atomic?"

The Small Stuff

One might find it logical to say something along the following lines:

If the operation is really small, it should be atomic, right?

And yeah, sure, that seems like a good idea at first, but that's actully wrong. Consider the following: If we have a struct

  struct A {
    char B;
    char C;
  } S;

What would happen if we were to store to both s.b and s.c at the same time? Because both are stored in the same word, the processor must perform more than one operation to isolate the lower/higher bits from the rest of the word, making the operation ultimately not atomic.

The Long Stuff

Alright, so we know we can't guarantee that small loads and stores will be atomic, what about really long stuff?

As you might have guessed, these too are not atomic. The processor may have to perform these operations using two instructions, as the load is larger than their standard word size.

  load word
  or word, 0xFF

Because it must use multiple instructions, it too may not be atomic.

The Unaligned Stuff

So, surely now we can say that "regularly" size things must be atomic, right? Again, wrong: if a word is not correctly aligned in memory, we cannot make this guarantee. This is because the processor must again load two different words, then manipulate them in some way so as to get the correct selection corresponding to the correct word.

  load hi_word
  load lo_word
  shr lo_word, 12
  and lo_word, 0xFF
  shl hi_word, 4
  and hi_word, 0xFFFFFF00
  or hi_word, lo_word

Needless to say, this is not atomic.

Aligned Words (Hint: It's Atomic)

So, naturally, we can guarantee that aligned word sized loads and stores work atomically. However, even if they didn't, we could still implement locks. Luckily, we don't have to worry about that in this class, and can hold off that discussion until a later time.

Even though words are load/store atomic, the processor still tries to help with locking algorithms, because it can be tricky to get right even with these guarantees.

For example, there's the lock incl x instruction (assuming x is an aligned 32-bit int). It's basically an atomic x++ instruction. It announces its desire to lock on the bus, and all other processors get out of the way. Then it increments the memory, and shouts out "I'm done!"

However, the uses for this are limited. Increasing by one is slow if we want to atomically store an arbitrary value, as we'd have to increment over and over until we reach our value. It's not particularly useful.

Instead, we want a subroutine where we can lock entire blocks of code. However, Intel does not provide us with such a construct, so we'll have to make due with doing it on our own.

Implementing a Lock on x86

There is the compare-and-swap instruction. In C code, this is:

  bool compare_and_swap(int *addr, int old, int new) {
    // It will atomically replace the 
    // contents of addr from old to new.
    if (*addr == old) {
      *addr = new;
      return true;
    }
    return false;
  }

It works, again, by shouting on the bus. However, it is nicer, because it allows for us to implement locks more efficiently.

  void add1(int a) {
    for (;;) {
    int old = *a;
    if (compare_and_swap(a,old,old+1);
      return;
    }
  }
  
  void atomic_apply( int a, int (*f)(int) ) {
    for (;;) {
    int old = *a;
    if (compare_and_swap( a, old, f(old) );
      return;
    }
  }

Is this enough to do everything we need? Let's look back at our implementation of pipes from last lecture to see if it works for our purposes.

Software Lock Algorithms

Lamport's Bakery Algorithm was an algorithm designed by Leslie Lamport to implement mutexes between threads without special instructions. The original research paper can be found here.

Test-and-Set

Wow, wouldn't it be nice if there were some kind of instruction that tested the result of an operation, then set a value based on it? Well, there is one:

  int 
  testNset (
       int *p, int new)
  {
    int old = *p;
    *p = new;
    return old;
  }

Naturally, it is atomic.

Can we implement lock with testNset? Yes! We simply testNset, then check to make sure it succeeded, else loop.

  void lock(lock_t *p) {
    while (testNset(p,1))
      continue;
  }

Case Study: Locking on a Pipe

From last time, we defined our pipe structure as follows:

  #define N 1024 // It's nice if this is 2^n.
  
  struct pipe {
    char buf[N];
    size_t r,w;
  } buf;

It's basically implemented as a circular buffer.

A write instruction to this pipe works as follows:

  void write(struct pipe *p, char c); {
    while ((p->w) - (p->r) == N) continue;
    p->buf[p->w++%N]=c;
  }

  char readc(struct pipe *p) {
    while ((p->w) - (p->r) == N) continue;
    return p->buf[p->r++%N];
  }

The problem with this code is that if two reads occur at the same time, both will get the same byte, because one will have read the byte before the other had a change to increment p->r.

How can we fix this? Can we use compare and swap? Only if we need to update a single word. But what if we need to update two words? In this case, we need both w and r to see if the pipe is empty or full.

One way to solve the problem would be to make r,w both shorts. Thus they would fit into a single word. However, we don't want to limit our structures of arbitrary size to a single word.

We can instead use a lock function, as discussed last time.

  lock(&a_lock);
  unlock(&a_lock);

We simply add a lock somewhere in the kernel globally and change our code to be:

  void write(struct pipe *p, char c); {
    lock(&l);
    while ((p->w) - (p->r) == N) 
      { unlock(&l); yeild(); lock(&l) };
      // See note below
    p->buf[p->w++%N]=c;
    unlock(&l);
  }
  
  char readc(struct pipe *p) {
    lock(&l); 
    while ((p->w) - (p->r) == N) 
      { unlock(&l); yeild(); lock(&l) }; 
      // This "unlock/lock" behavior is necessary 
      // so that others can get a hold
      // on the pointers. Else, we'd just loop forever.
    char r = p->buf[p->r++%N];
    unlock(&l);
    return r;
  }

Now we shouldn't have a problems with race conditions.

Lock Granularity

So, for our pipe functions, you may have been wondering, "Why use a global lock instead of a per-structure lock?" The answer is a complicated one.

A lock can lock a certain amount of data. For the pipe, we locked all the pipe structures with a single lock. Encompassing a lot of data with a single lock is called course grain locking. Course grain locking can also refer to a lock that controls a small amount of data, but for a very long period of time.

Conversely, fine grain locks are much finer grain. If we wanted to be more fine grain, we could, for example, give each pipe its own lock. That way, we have more fine grain control of the pipe.

Finer grained locks tend to encourage more parallelism by removing bottlenecks. However, it's also a lot more complicated to manage a ton of locks, so too fine-grain a control on locks is to be avoided.

Atomic Fission

With pipe, read and write are small functions, so their critical sections (the entire code) aren't so bad. However, what if we have to do some huge operation, such as auditing an entire bank?

It's a critical section, but while it's running, nobody can go to the bank. We can't just lock specific accounts, because we need to analyze relations between different bank accounts.

However, by taking snaphots of the bank, we can just audit the snapshot! Now, nothing can change the snapshot, and we can continue on our merry way auditting.

The snapshot taking process has to be atomic, naturally, but it's a lot faster the auditting the bank manually. In fact, we can make snapshot really fast, by having each operation copy to the snapshot on each read or write.

Dividing up a large atom into smaller atoms in this way is called "atomic fission."

The Goldilocks Principle

If we make our critical sections too large, then we'll have too many bottlenecks. However, if we make our critical sections too small, we can introduce race conditions by not encapsulating the important sections of the code.

Somewhere in between lies the area that is "Just Right." We want minimal critical sections, that are as small as possible without being too small.

Here's a heuristic for discovering a minimal critical section:

  /* Look in code for dangerous things */
  /* If code is just accessing local variables, 
       everything is perfectly fine.*/
  /* We only have to worry about shared state, 
       and only when they're writing.*/
  /* Rather, writes+reads = bad, but we choose just reads.*/
  1. Look for writes to shared state.
  2. Find all dependent reads for each write.
  /* That is, find all reads that give 
       information for what to write.*/
  3. Lock the smallest possible segment containing 
       all of said reads and writes.

Bounded Waits

There are two properties of locks that must be satisfied for them to be of any real use to us. These are:

Mutual Exclusion - I have exclusive access to the object if I am in the critical section for that object.
Bounded Waits - I should never have to starve when I'm waiting for an object. That is, all locks are eventually locked.

We've talked a lot about mutual exclusion, but bounded waits are trickier. Let's come up with a better way of doing locking that allows for us to ensure or at least approximate bounded wait better. Currently, we burn CPU cycles when waiting for a lock to be free, preventing other threads from using the CPU. This means that the fastest processor will always win the race for the lock!

If we instead yield the CPU when waiting for the lock, we'll burn less CPU, but we'll still have a lot of lock contention. Lets come up with a new type of locking that is better.

Deadlock

Deadlock was briefly mentioned in lecture, but it would probably be most educational if you read up on it at another source, as time was short in class. Some resources include:

Blocking Mutexes

Thus far, we've essentially implemented a normal mutex. However, to solve the problem of bounded waits, we instead want a mutex that will block and then queue any threads trying to lock the mutex. In this way, we don't have to worry about a thread being starved, because each thread will eventually get a turn to try the lock.

The struct for a blocking mutex looks like this:

  struct {
    bool locked; // Standard spinlock
    proc_t *blocked_list; // A list of processes waiting for this lock.
  } bmutex_t;

We also have a function acquire(), which is to blocking mutexes as lock is to lock_t (regular mutexes).

  void acquire(bmutex_t *b) {
    for (;;) {
      lock(&b->l);
      // Insert here: add self to b->blocked_list
      if (!b->locked) break;
      unlock(&b->l);
      yeild();
    }
    b->locked = true;
    unlock(&b->l);
  }
  
  void release(bmutex_t *b) {
    lock(&b->l);
    schedule(1 process in b->blocked_list)
    b->locked = false;
    unlock(&b->l);
    yeild(); // Because we are on blocked_list, 
       // we will not be called;
  }

Last Editted on: May 4th, 2010 by Steven Weiss

Some source code property of Mary Chau and Tarry Chen, used with permission. Original source may be found here.

CS 111, Lec 9: Scribe Notes by Steven Weiss is licensed under a Creative Commons Attribution 3.0 Unported License.

Support

If you find an error, my email is: steventrouble@gmail.com.
If you found this template useful, and want to give back: