Lecture 9: Critical Sections, Synchronization Primitives and Deadlock

CS 111, Winter 2016

By Elise Yuen, Stacy Miller and Eric Chan



Theory on Policy - Goldilocks Principle

How big should a critical section be? The Goldilocks Principle states that things must fall within a certain margin, and not be at extremes. We can apply this to critical sections. When a critical section is too small, we get races. On the other hand, when a critical section gets too large, there is no parallelism.


Bank Account Example

Let's look at this from the perspective of a bank transfer. If we want to transfer money from one bank account to another using the following function:

bool transfer(bankAct fromacct, bankAct toacct, int amt);

This function will perform as expected in a single threaded system, but what happens in a multithreaded system? When actions are performed in parallel, we can no longer guarantee the consistency of our results due to potential race conditions. Consider the following scenario:

Let bankAct A have $50 and B have $100. We can transfer $30 from A to B with

transfer(A, B, 30);

We would expect this to result with $20 in A and $130 in B. However, suppose that a second concurrent thread is running the following

transfer(A, B, 30);

A logical result from this transaction would be to have the first transfer succeed, and the second fail due to lack of funds. However, we may not end up with this desired result because of possible race conditions. If we concurrently access bank account A to view the amount of stored money, both threads will be notified that A has enough funds to apply the transfer. A will then end up with either $20, or -$10, depending on other race conditions based on storing A.

Thus, we will want to prevent race conditions by implementing locks in the following manner:

lock(fromacct);
// delicate actions on fromacct
unlock(fromacct);

lock(toacct);
// delicate actions on toacct
unlock(toacct);

What issues may arise? As stated before, following the Goldilocks Principle leads to deadlock possibilities, now that we have introduced locking. Imagine that auditors then come to your bank to check the accounting books and return a boolean value if all the account values line up correctly.

bool audit(all accounts);

How could we audit the system while maintaining consistency and preventing race conditions? One possible solution would simply be to grab locks on all account at some time and then audit the accounts as is shown below.

bool audit(all accounts){
    ...
    ...
    acquire locks on all accounts;
    // delicate critical section
    release all locks;
    ...
    ...
};

However, even if we ignore the issue of deadlock here, we clearly have a performance issue. A practical way to work around this would to be to perform the audit and locking at some time when very few people would be using the bank’s services.

Another possible solution would be to assume that there exists some primitive that allows us to take a snapshot of the bank’s system at any given point and that the task can be done quickly. While we would still need to acquire locks on all accounts using this method, the premise assumes we can release these locks far more quickly than we would be able to using the first method. However, we would then have to design this snapshot primitive, effectively reducing the problem into an equally hard problem.


Practice on Policy with Mechanisms

We have discussed the ideas of how to implement policy, but how do we physically do it?

Suppose we have a uniprocessor. We want to disable interrupts that might affect critical sections. We especially want to make sure to disable the clock interrupt since this is what the kernel uses during preemptive scheduling when deciding when to change control of the CPU from one thread to another. Note that we only want to disable these temporarily because if they are disabled indefinitely the process may hog the processor and introduce starvation. This solution essentially implements cooperative multitasking, as it disables preemptive scheduling with the assumption that the thread will later turn it back on.

Now suppose we have a multiprocessor. Let’s use a sample program to illustrating the use and need for locks, while involving pipes of our own.

Implementing function for our critical section:

lock(&l);
//if anyone else tries to acquire lock when you have lock, they will have to wait
//fix correctness bugs by including lock in pipe struct
unlock(&l);

More on the actual implementation further down! First, let’s declare our pipe struct:

struct pipe{
    char buf[1024];
    unsigned r,w;
    lock_t lock;
};

However, this has a mistake! When r and w are equal, we can't tell a full pipe from an empty pipe! Solution: If (r – w) % 1024 is 0, then the buffer is empty, else it is full.

Now, let’s implement the write pipe function:

void write_pipe(struct pipe *p, char c){
    p->buf[p->w++%1024]=c;
}

Problem: When the pipe is full, the function will still continue to write to the pipe. In order to solve this problem, let’s change the function to return a bool so that it returns true on success, and false on failure. Additionally, add an if statement that checks to see if the pipe is full:

bool write_pipe(struct pipe *p, char c){
    lock(&p->lock);
    if(p->w-p->r==1024){
        unlock(&p->lock);
        return false;
    }
    p->buf[p->w++%1024]=c;
    unlock(&p->lock);
    return true;
}

But who wants to write unlock(&p->l) twice? Let condense the code so that we don’t need to call unlock() twice!

bool write_pipe(struct pipe *p, char c){
    char *ptr = 0;
    lock(&p->lock);
    bool full = p->w-p->r==1024;
    if(!full)
        ptr = &p->buf[p->w++%1024];
    unlock(&p->lock);
    if(!full)
        *ptr = c;
    return !full;
}

Unfortunately, this code still results in issues. Writing to the pipe outside of the critical section could lead to incorrect writes. The value of c can be altered by some other thread before the write is conducted.

bool write_pipe(struct pipe *p, char c){
    lock(&p->lock);
    bool full = p->w-p->r==1024;
    if(!full)
        p->buf[p->w++%1024] = c;
    unlock(&p->lock);
    return !full;
}

Now, let’s work on implementing read_pipe:

char read_pipe(struct pipe *p){
    return p->buff[p->r++%1024];
}

Problem: What if the pipe is empty? We will have to check if the read and write pointers (r & w) are currently pointing to the same location or not.

int read_pipe(struct pipe *p){
    if (p->w==p->r)
        return CHAR_MIN-1;
    return p->buff[p->r++%1024];
}

However, this code still produces race conditions! For example, what if a thread was writing to the particular section at the same time as a read, or if two threads are reading at the same time? Therefore, we must establish a critical section using locks.

int read_pipe(struct pipe *p){
    lock(&p->lock);
    int c = p->r==p->w?p->buf[p->r++%1024]:CHAR_MIN-1;
    unlock(&p->lock);
    return c;
}

This is a satisfying implementation of read_pipe(). Now back to the actual implementation of lock and unlock (for our critical sections), we can do this in two ways.

  1. Syscalls -- This is easily doable since it will trap when executing. However, it is often not fast enough.
  2. Plain machine instructions

Let’s write some sample implementations of lock() and unlock():

// 0 = unlocked
// 1 = locked
typedef char lock_t; // char is smallest = faster?
// precondition = we don’t have the lock
void lock(lock_t *l)}{
    while(*l)
        continue; // checks that someone else has the lock
        // RACE CONDITION HERE!!
    *l = 1;
}
// postcondition = we have the lock

// precondition = we have the lock
void unlock(lock_t *l){
    *l = 0;
}
// postcondition = we dont have the lock

Unfortunately, this implementation doesn’t work since a race condition exists in the implementation of lock(). There is a software solution to this problem, but usually, you just call the hardware guys to fix it.


How Does Hardware Access Actually Work?

When caching and using cache lines (say, of size 32 bytes), all loads and stores to RAM are 32 bytes at once. Another cache might write to RAM at the same time, so the lock implementation might not be safe. A solution is cache coherence, which ensures that changes in operands with shared values are implemented in a timely manner. When the cache sees another cache talk to RAM in an area of memory that it access, the cache discards its contents. However, this slows the program. Also, this still does not solve unlock implementation problem.

We need to find some way to let us get lock in an atomic way. But Eggert’s solution is to get help from the hardware guys again.

Spin locks are locks that make threads wait in a while loop, while repeatedly checking if the lock is available. The following are some notes about them.

  1. Spin locks should not be too large

    typedef __int128 lock_t; // 128 bits!!
    *l = 0; // 2 machine instructions, movq x2 (2 64-bit insns)

    Such a large spin lock (128 bits) causes a single line of code to need two 64-bit instructions (movq). This causes unlock() to no longer be atomic since there can be a race condition between the movq instructions.

  2. Spin locks also should not be too small

    This example uses bit fields, in which the compiler picks out ONE bit in the next memory location (for this example).

    struct pipe {
        unsigned char lock:1; // bit field
    }

    On x86-64 & x86, int works when aligned on x86 and x86-64 (need to be addresses that are multiples of 4).


Extra Instructions for Atomic Access for 'int'

asm("lock incl x") → At some memory location, grab x value, increment, then store it. This is not enough for general purpose locking because it will blindly add 1 to a lock regardless and still introduces race conditions.

asm("xchgl r,x") → Specify register r and memory x, and exchange contents. This will return NULL if r and x are the same.

Below is an abstract C-code representation of what the xchgl instruction performs atomically:

int xchgl(int *p, int val){
    int old = *p;
    *p = val;
    return old; // actually done atomically, asm
}

void lock(int *l){
    while(xchgl(l,1))
        continue;
}

bool cas(int *p, int old, int new){ // “compare and swap”
    if (*p == old){
        *p = new;
        return true;
    } else return false;
}

Example: The following shows how we can use these functions in a critical section

int old = x;
int new = f(x);
while(xchgl(&x,new)!=old)
    continue;

This implementation tries to exchange the values without locking, but this doesn’t work. Let’s try a new implementation that instead uses cas(), which stands for “compare and swap”.

bool cas(int *p, int old, int new){
    if(*p==old){
        *p = new;
        return true;
    }
    else
        return false;
}

We can condense the inner implementation of cas() to the following code:

do{
    int old = x;
}
while(cas(&x, old, f(x)));

Problem: In this implementation, N threads accessing some pipe, and therefore N-1 threads will spin!

read_pipe, write_pipe
    lock(&global_lock)
    work
    unlock(&global_lock)

struct pipe{
    data > 1000 bytes
}
lock_t global_lock;

This is called course-grained locking, where large parts of codes are locked in one big piece.
Pros: Very simple to implement.
Cons: Costly performance, in particular access to global lock becomes performance bottleneck, and this will prevent anyone else from accessing pipe.

The approach that we implemented before is called fine-grain locking, in which there are separate lock for each individual piece.
Pros: Fewer bottlenecks if you play your cards right.
Cons: More complicated to implement, and costs more memory.

We can alter the pipe implementation as follows:

struct pipe{
    char buf[1024];
    unsigned r,w;
    lock_t rl, wl; //lock for reading and writing pipe
}

We use different pointers for different locks so that we may read and write to the same pipe concurrently. Suppose there are many readers and few writers. Readers are constantly trying to read out of the pipe and constantly getting locked.

while((c=read_pipe)==CHAR_MIN-1)
    continue;
// do work

This implementation will chew up CPU time, because the while loop will constantly be looping and checking the values. Instead, let’s use a function called yield() to solve this issue.

while((c=read_pipe)==CHAR_MIN-1)
    yield();
// do work)

This will give the CPU to a more deserving thread, instead of doing useless work since we skip over readers that can’t possibly succeed by setting a bit to notify the scheduler they are waiting. The overhead is the size of number of threads trying to get control of the pipe.