Can we build locks from load and store?
That is to say, are load and store atomic?
One might think that is always the case that both are atomic. However, when
a load and store are executed at the same time to
the same object, it gets a little tricky; it may be entirely possible that
said load could get complete garbage.
So we'd like to know, under what conditions can we guarantee that a given
load or store will be atomic? This brings us to our
first game of the evening, "Is it Atomic?"
One might find it logical to say something along the following lines:
If the operation is really small, it should be atomic, right?
And yeah, sure, that seems like a good idea at first, but that's actully
wrong. Consider the following: If we have a struct
struct A {
char B;
char C;
} S;
What would happen if we were to store to both s.b
and s.c at the same time? Because both are stored in the same
word, the processor must perform more than one operation to isolate the
lower/higher bits from the rest of the word, making the operation ultimately
not atomic.
Alright, so we know we can't guarantee that small loads and
stores will be atomic, what about really long stuff?
As you might have guessed, these too are not atomic. The processor
may have to perform these operations using two instructions, as the
load is larger than their standard word size.
load word
or word, 0xFF
Because it must
use multiple instructions, it too may not be atomic.
So, surely now we can say that "regularly" size things must be atomic, right?
Again, wrong: if a word is not correctly aligned in memory, we cannot make
this guarantee. This is because the processor must again load two different
words, then manipulate them in some way so as to get the correct selection
corresponding to the correct word.
load hi_word
load lo_word
shr lo_word, 12
and lo_word, 0xFF
shl hi_word, 4
and hi_word, 0xFFFFFF00
or hi_word, lo_word
So, naturally, we can guarantee that aligned word sized loads
and stores work atomically. However, even if they didn't,
we could still implement locks. Luckily, we don't have to worry about
that in this class, and can hold off that discussion until a later time.
Even though words are load/store atomic, the
processor still tries to help with locking algorithms, because it can be
tricky to get right even with these guarantees.
For example, there's the lock incl x instruction
(assuming x is an aligned 32-bit int).
It's basically an atomic x++ instruction. It announces its desire to lock on
the bus, and all other processors get out of the way. Then it increments the
memory, and shouts out "I'm done!"
However, the uses for this are limited. Increasing by one is slow if we want to atomically store an arbitrary value, as we'd have to increment over and over until we reach our value. It's not particularly useful.
Instead, we want a subroutine where we can lock entire blocks of code.
However, Intel does not provide us with such a construct, so we'll have to
make due with doing it on our own.
There is the compare-and-swap instruction. In C code, this is:
bool compare_and_swap(int *addr, int old, int new) {
// It will atomically replace the
// contents of addr from old to new.
if (*addr == old) {
*addr = new;
return true;
}
return false;
}
It works, again, by shouting on the bus. However, it is nicer, because it
allows for us to implement locks more efficiently.
void add1(int a) {
for (;;) {
int old = *a;
if (compare_and_swap(a,old,old+1);
return;
}
}
void atomic_apply( int a, int (*f)(int) ) {
for (;;) {
int old = *a;
if (compare_and_swap( a, old, f(old) );
return;
}
}
Is this enough to do everything we need? Let's look back at our
implementation of pipes from last lecture to see if it works for our purposes.
Lamport's Bakery Algorithm was an algorithm designed by Leslie Lamport
to implement mutexes between threads without special instructions.
The original research paper can be found here.
See also:
Wow, wouldn't it be nice if there were some kind of instruction
that tested the result of an operation, then set a value based on it?
Well, there is one:
int
testNset (
int *p, int new)
{
int old = *p;
*p = new;
return old;
}
Naturally, it is atomic.
Can we implement lock with testNset?
Yes! We simply testNset, then check to make sure it succeeded, else loop.
void lock(lock_t *p) {
while (testNset(p,1))
continue;
}
The problem with this code is that if two reads occur at the same time, both
will get the same byte, because one will have read the byte before the other
had a change to increment p->r.
How can we fix this? Can we use compare and swap? Only if we need to update a
single word. But what if we need to update two words? In this case, we need
both w and r to see if the pipe is empty or full.
One way to solve the problem would be to make r,w both shorts. Thus they would
fit into a single word. However, we don't want to limit our structures of
arbitrary size to a single word.
We can instead use a lock function, as discussed last time.
lock(&a_lock);
unlock(&a_lock);
We simply add a lock somewhere in the kernel globally and change
our code to be:
void write(struct pipe *p, char c); {
lock(&l);
while ((p->w) - (p->r) == N)
{ unlock(&l); yeild(); lock(&l) };
// See note below
p->buf[p->w++%N]=c;
unlock(&l);
}
char readc(struct pipe *p) {
lock(&l);
while ((p->w) - (p->r) == N)
{ unlock(&l); yeild(); lock(&l) };
// This "unlock/lock" behavior is necessary
// so that others can get a hold
// on the pointers. Else, we'd just loop forever.
char r = p->buf[p->r++%N];
unlock(&l);
return r;
}
Now we shouldn't have a problems with race conditions.
So, for our pipe functions, you may have been wondering, "Why use a global
lock instead of a per-structure lock?" The answer is a complicated one.
A lock can lock a certain amount of data. For the pipe, we locked all the pipe
structures with a single lock. Encompassing a lot of data with a single lock
is called course grain locking. Course grain locking can also refer to
a lock that controls a small amount of data, but for a very long period of
time.
Conversely, fine grain locks are much finer grain. If we wanted to be
more fine grain, we could, for example, give each
pipe its own lock. That way, we have more fine grain control of the pipe.
Finer grained locks tend to encourage more parallelism by removing
bottlenecks. However, it's also a lot more complicated to manage a ton of
locks, so too fine-grain a control on locks is to be avoided.
With pipe, read and write are small functions, so their critical sections
(the entire code) aren't so bad.
However, what if we have to do some huge operation, such as auditing an
entire bank?
It's a critical section, but while it's running, nobody can go
to the bank. We can't just lock specific accounts, because we need to analyze
relations between different bank accounts.
However, by taking snaphots of the bank, we can just audit the
snapshot! Now, nothing can change the snapshot, and we can continue on our
merry way auditting.
The snapshot taking process has to be atomic, naturally,
but it's a lot faster the auditting the bank manually.
In fact, we can make snapshot really fast, by having each operation copy to
the snapshot on each read or write.
Dividing up a large atom into smaller atoms in this way is called "atomic
fission."
If we make our critical sections too large, then we'll have too many
bottlenecks. However, if we make our critical sections too small, we can
introduce race conditions by not encapsulating the important sections of
the code.
Somewhere in between lies the area that is "Just Right." We want minimal
critical sections, that are as small as possible without being too small.
Here's a heuristic for discovering a minimal critical section:
/* Look in code for dangerous things */
/* If code is just accessing local variables,
everything is perfectly fine.*/
/* We only have to worry about shared state,
and only when they're writing.*/
/* Rather, writes+reads = bad, but we choose just reads.*/
1. Look for writes to shared state.
2. Find all dependent reads for each write.
/* That is, find all reads that give
information for what to write.*/
3. Lock the smallest possible segment containing
all of said reads and writes.
There are two properties of locks that must be satisfied for them to be of any
real use to us. These are:
Mutual Exclusion - I have exclusive access to the object if I am in the
critical section for that object.
Bounded Waits - I should never have to starve when I'm waiting for an
object. That is, all locks are eventually locked.
We've talked a lot about mutual exclusion, but bounded waits are trickier.
Let's come up with a better way of doing locking that allows for us to ensure
or at least approximate bounded wait better. Currently, we burn CPU cycles
when waiting for a lock to be free, preventing other threads from using the
CPU. This means that the fastest processor will always win the race for the
lock!
If we instead yield the CPU when waiting for the lock, we'll burn less CPU,
but we'll still have a lot of lock contention. Lets come up with a new type
of locking that is better.
Deadlock was briefly mentioned in lecture, but it would probably be most
educational if you read up on it at another source, as time was short in
class. Some resources include:
Thus far, we've essentially implemented a normal
mutex.
However, to solve the problem of bounded waits, we instead want a mutex that
will block and then queue any threads trying to lock the mutex. In this way,
we don't have to worry about a thread being starved, because each thread will
eventually get a turn to try the lock.
The struct for a blocking mutex looks like this:
struct {
bool locked; // Standard spinlock
proc_t *blocked_list; // A list of processes waiting for this lock.
} bmutex_t;
We also have a function acquire(), which is to blocking mutexes as lock is to
lock_t (regular mutexes).
void acquire(bmutex_t *b) {
for (;;) {
lock(&b->l);
// Insert here: add self to b->blocked_list
if (!b->locked) break;
unlock(&b->l);
yeild();
}
b->locked = true;
unlock(&b->l);
}
void release(bmutex_t *b) {
lock(&b->l);
schedule(1 process in b->blocked_list)
b->locked = false;
unlock(&b->l);
yeild(); // Because we are on blocked_list,
// we will not be called;
}
Last Editted on: May 4th, 2010 by Steven Weiss
Some source code property of Mary Chau and Tarry Chen, used with permission. Original source may be found here.
CS 111, Lec 9: Scribe Notes by Steven Weiss is licensed under a Creative Commons Attribution 3.0 Unported License.