Lecture 9 Scribe Notes

Critical Sections

A critical section is a portion of code that must be executed atomically - that is, from beginning to end without being interrupted.

Example:
acct1 += amount;
acct2 -= amount;

If we encounter an interrupt signal between the two lines of code or two threads execute this code at the same time, the values of acct1 and acct2 may not be consistent with what we expect them to be. We need to ask the hardware guys to help solve this problem.

[CPU] [CPU]
[cache][cache]
     |          |
----------------- BUS
         |
     [RAM]

If the CPU enters a critical section, we'll send out a signal to the other CPUs over the bus to ensure the atomicity of the critical section.

x86 Instructions:
lock incl x
-atomic increment of x by 1
lock xchgl x, %eax
-atomic exchange of x and %eax

Biggest problem: SLOW and we can't make use of the cache to improve performance.
Therefore, we want to use instructions like these scarcely.

C Implementation of xchgl:
int exchange(int *p, int newval){
    int oldval = *p;
    *p = newval;
    return oldval;
}

Locking

Our first implementation of locking is a simple spin lock. If the lock is available, acquire the lock, execute the critical section, and unlock the lock. Otherwise, we'll spin in the lock function until we can acquire the lock.

typedef int volatile lock_t;
void lock(lock_t *p){
    while(exchange(p,1))
         continue;
}

void unlock(lock_t *p){
    *p = 0;
}

Precondition for unlock: You must own the lock before you unlock it.
Precondition for lock: You must not own the lock before you lock it.

There are two basic properties of critical section code that we need to satisfy.

Mutual exclusion - This is satisfied by our spin lock. If I own the lock, no other process or thread can execute the critical section code at the same time I do.
Bounded wait - We want the critical section code to be as simple as possible, and more importantly, we have to know we won't spend an unreasonable amount of time in the critical section.

The first problem we notice with our spin lock - One CPU can cause N-1 CPUs to busy wait if they all need access to the lock. Our spin lock becomes a huge bottleneck to performance.

One simple way to avoid this: Make the critical section code so simple that it's atomic on its own already, so we don't need the spin lock.
What can we do atomically?

global int x;
Thread A: int i = x;
Thread B: x = 27;
Thread C: x = 192736;

If Thread B and C execute before A, we know that i will either equal 27 or 192736; there are no other possibilities.
However, we run into problems if our variable is either too small or too large.

Too small:
Consider the following struct and code segment:
struct {
unsigned int a: 7; //7 bits
unsigned int b: 1; //1 bit
};
Thread A: int i = x.b;
Thread B: x.b = 1;
Thread C: x.a = 97;
Even though we set x.b equal to 1 in Thread B, because of how the variable is loaded and stored, we can't be sure that i will equal 1.

Too large:
If we have long long x, since x is now a 64 bit variable, the value of x will depend on which order we load and store each half of the word. We may not be guarenteed a value consistent with what we expect.

Locking and Pipes

A pipe acts as a way to communicate between different processes. One process writes data to a buffer which another one can read from.

#define N 1024
lock_t l;
struct pipe{
     char buf[N];
     size_t r,w;
};

Visual Representation:
0            buf               1024
--------------------------
|///////////| DATA |/////////////|
--------------------------
            ^           ^
            |             |
            r            w
Read and Write implementation for our pipe:

char read_pipe(struct pipe *p){
     lock(&l);
     while(p->r == p->w){
         unlock(&l);
         yield();
         lock(&l);
    }
     char c = p->buf[p->r++%N];
     unlock(&l);
     return;
}

void write_pipe(struct pipe *p, char c){
     lock(&l);
     while(p->w - p->r == N){
         unlock(&l);
         yield();
         lock(&l);
    }
     p->buf[p->w++%N] = c;
     unlock(&l);
}

Important Things to Notice:

CPU overhead decreases our performance slightly.
Without the unlock and lock in the while loops of both read and write pipe, we lose our bounded wait condition. Instead, we'll give up the CPU to someone else who can do something while we wait. This avoids busy wait.

We notice it would be useful to allow 2 threads to read or write into 2 different pipes. Our code is too conservative in its current state. lock_t is a coarse grained lock, but we would like a finer grained lock in this case.

A potential solution: Add a lock_t variable in our pipe struct and in read and write pipe, turn all lock(&l) and unlock(&l) into lock(&p->l) and unlock(&p->l).
This is a much finer grained lock, but our approach doesn't scale well if we have, for example, tons of writers and few readers. We need something better than yield() to better utilize CPU time.

Blocking Mutex

A blocking mutex is different from a spin lock in that if you want to acquire the lock, instead of looping until you get it, the OS will put you to "sleep" when you want to acquire it and wake you up when there's a good chance you will. In the meantime, another process/thread can run while you block.

We'll use the following implementation for our blocking mutex with acquire and release functions:

typedef struct {
     lock_t l;
     bool locked;
     thread_t *blocked_list;
};

void acquire(bmutex_t *b){
     lock(&b->l);
     while(b->locked){
         thread_t *l = b->blocked_list;
         b->blocked_list = self;
         self->runnable = false;
         self->next = l;
         unlock(&b->l);
         yield();
         lock(&b->l);
     }
     b->locked = true;
     unlock(&b->l);
}

void release(bmutex_t *b){
     lock(&b->l);
     b->locked = false;
     thread_t *next = b->blocked_list;
     if(next){
         next->runnable = true;
         b->blocked_list = next->next;
     }
     unlock(&b->l);
}

A more general blocking mutex is called a semaphore which is a blocking mutex with arity greater than or equal to one. For example, we may want a semaphore which restricts the number of people logged in to a website to 10. A blocking mutex is also known as a binary semaphore.

With our blocking mutex, we can change our pipe to incorporate them instead of just a spin lock. This further reduces spinning in our read and write functions.

Performance issue: The OS can't tell the difference between readers and writers. We may encounter performance problems if the number readers and writers are very lopsided.

To help remedy this, we'll introduce condition variables into our pipe struct. Condition variables represent a condition the application is in interested. For example, "Is the pipe empty?" or "Is the pipe full?" in our case. We implement this by adding the following lines of code to our pipe struct:
condvar_t nonempty;
condvar_t nonfull;

Now, at the end of our write_pipe function we can add the line notify(&p->nonfull), and at the end of our read_pipe function we can add the line notify(&p->nonempty). This will help increase performance by letting the OS know that readers will have work to do when the pipe is nonfull and writers have work to do when the pipe is nonempty.

Deadlock

Deadlock occurs when processes/threads are waiting on each other in order to progress. If we have two threads A and B, for example, and A is waiting on B, but B is also waiting on A, then no work gets done and we have deadlock.

Deadlock Example:
Imagine we have a new function called splice which takes two file descriptors. When you read from one fd, it will automatically write to the other one. This seems to simplify our use of pipes. What's the problem though?

Let's say we have two threads, which both need to lock the file descriptors fd0 and fd1 in order to call splice. Thread A calls splice(fd0, fd1) and thread B calls splice(fd1, fd0). If both threads are executing at the same time and lock file descriptors in the order of the list of arguments to the function, we may run into the scenario where A has a lock on fd0, and B has a lock on fd1. Both will try to to lock the file descriptor they don't have but will wait forever since the other thread has it. This is a simple example of deadlock.