Lecture 9 Scribe Notes

CS 111 at UCLA | Winter 2016

Rakshit Garg, Rohan Kapoor, James Vaughan, Ronaq Virdi

Goldilocks
Critical Section Implementation
- On a Single Processor
- On Multiple Processors
Lock Implementation
- System Calls
- Regular Code
Hardware Answers - Caching
How are we going to get locks to work?
Locking with Pipes

Goldilocks

In this lecture, we mostly talked about mechanisms for dealing with and avoiding race conditions. This is done with a lock. A lock is a variable that tells the machine whether or not it is allowed to execute a section of code. If one thread locks a section of code, other threads will not be able to enter that section of code until the first thread has unlocked it. These lockable sections of code are called critical sections. There are two important things to keep in mind when dealing with critical sections. (Well, there are a lot more than two, but we'll just focus on these two for now.)

It is important that these critical sections not be too large, or they will slow down your program to an unnecessary degree.
It is important that these critical sections not be too small, or they will not include all of the code that should be locked and introduce race conditions.

This need for balance is known as the Goldilocks Principle.

Consider this example: We have a program that manages bank accounts and has a function transfer(from, to) to make transfers between accounts. Suppose we implement this function in the following way.

void transfer(account* from, account* to, int amount) {
  from->balance -= amount;
  to->balance += amount;
}

Can you see the problem here? It should be obvious to you; if we are transferring money to account A at the same time as someone else, we will run into trouble. A quick solution to this problem would be to wrap the actions on the accounts in locks so that only one thread can be changing an account's balance at a time:

void transfer(account* from, account* to, int amount) {
  lock(from);
    from->balance -= amount;
  unlock(from);
  lock(to);
    to->balance += amount;
  unlock(to);
}

This solution helps a bit, but it doesn't solve all of our problems. There is still a period, between the two locks, where we are not in a totally valid state. This is an example of critical sections that are too small. To solve this, we will wrap both of the actions at the same time:

void transfer(account* from, account* to, int amount) {
  lock(from);
    lock(to);
      from->balance -= amount;
      to->balance += amount;
    unlock(to);
  unlock(from);
}

Alright, now it looks like we're in better shape, but unfortunately we still have some problems. We're holding two locks at the same time. What happens when we've got a lock on account A and we are about to lock account B as well, but another thread has already locked B and is waiting for A? This situation is called deadlock. Following the goldilocks principle has the potential to create deadlock like this.

Critical Section Implementation

On a single processor

We want our critical section to small enough so that it doesn't interfere with performance, but big enough so that it doesn't introduce races in our code. On a single processor, this isn't difficult, because there aren't any other cores or threads that we are competing with. Using the bank analogy example, consider when we want to deposit money into our account. This might look something like this:

balance += amt;

But an increment is actually a couple of instructions in machine language (we have to load the balance, increment it, and store it), so we need to make sure that all of these instructions complete and are not interrupted in the middle. On a single CPU machine, we can do it by:

Disabling interrupts temporarily - if we disable the clock interrupt on our machine, the kernel won't give the CPU to some other thread and we can complete this task before we switch threads.
Cooperative multitasking - we execute no instructions when entering or exiting the criical section and ensure that the CPU is not given up in the middle of a critical section.

On multiple processors

With multiple processors, races between threads become more prevalent since multiple cores are executing code at the same time. Consider an implementation of pipes from scratch, without using the linux system call:

struct pipe {
  char buf[1024]; //arbitrary length
  unsigned r, w; //read and write pointers that indicate where in the buffer we should read or write from
}

Assume that the buffer is circular, so that when a read or write pointer hits the end of the buffer it continues from the beginning. However, this code is not entirely correct, because we can't tell an empty pipe from a full one. So, instead of accessing r and w directly as indexes into the buffer (where their values would range from 0 to 1023), we make their values between 0 and UINT_MAX. Then if a pointer r or w % 1024 is 0, the pipe is empty, and if it is 1 (or higher), the pipe is full. The following code will tell us how many spots are in the buffer:

(w - r) % 2014

Now, we have to create primitives that can access the pipe. To keep it simple, we force reads and writes to be 1 byte at a time.

bool write_pipe(struct pipe *p, char c) {
  if (p->w - p->r == 1024) //check to make sure the pipe isn't full.
    return false;
  //will increment write pointer and mod by 1024 so we can index the buf appropriately
  p->buf[p->w++%1024] = c;
  return true;
}

int read_pipe(struct pipe *p) {
  //if the pipe is empty, return an int that can't represented so the caller knows its an error.
  if (p->w == p->r)
    return CHAR_MIN - 1;
  return p->buf[p->r++%1024];
}

On a threaded system, though, concurrency introduces the problem of scheduling reads and writes. Consider the problem of Thread A is trying to write to the pipe at the same time that Thread B is reading from it. We can fix this by using spinlocks, so that if one thread has the lock, the other threads will have to wait to edit the pipe until that thread releases the lock. The other threads will keep chewing up CPU time until one is allowed to grab the lock. We must add a lock field to our pipe struct, since each pipe will have its own lock that lets it be modified or not.

struct pipe {
  char buf[1024];
  unsigned r, w;
  lock_t l;
}

bool write_pipe(struct pipe *p, char c) {
  lock(&p->l);
  bool full = p->w - p->r == 1024;
  if (!full)
     p->buf[p->w++%1024] = c;
  unlock(&p->l);
  return !full;
}

void read_pipe(struct pipe *p) {
  lock(&p->l);
  char c;
  if (p->r == p->w)
    c = p->buf[p->r++%1024];
  else
    c = CHAR_MIN - 1;
  unlock(&p->l);
  return c;
}

But how do we actually implement locks?

Lock Implementation

We can implement locks as either system calls or as plain machine code.

System Calls

The kernel will trap each time you want to acquire a lock or release a lock, figure out what other threads are interested in that same lock, and arbitrate. It will let you release your lock and grant it to another thread that wants it, and it will let you acquire a lock if possible.

PROS: Easy to implement

CONS: Slow because of the numerous times we have to switch between user and kernel mode

Regular Code

We can implement locks using simple locking and unlocking methods:

//precondition: we don't have a lock
//postcondition: we have a lock
//convention: l = 0 is unlocked, and l = 1 is locked
typedef char lock_t;
void lock (lock_t *l) {
  while (*l)
    continue;
  *l = 1;
  return;
}

//precondition: we have the lock
//postcondition: we don't have the lock
void unlock (lock_t *l) {
  *l = 0;
}

But we now have to worry about races, because this is just ordinary code that any thread can execute in user mode. There's a race condition right after the while ends, because some other thread might come out of the loop at the same exact time we do, both will grab the lock and set l to 1, and later down the road we'll get a deadlock. We get out of this by asking the hardware people to help us out.

PROS: Harder, because of race conditions described above.

CONS: Faster than using an ordinary system call.

Hardware Answers - Caching

When accessing memory (RAM), there is actually a cache that sits between the CPU and the RAM, and because cache is around 100x faster than RAM, it speeds up memory withdrawals considerably. When the system needs to load, it'll go and get the value from cache rather than from RAM. If its not in cache, it'll find it from RAM. Caches are 32 bytes wide, so a typical load at the hardware level will get 32 bytes at a time.

This creates a problem. Setting the value of the lock only flips a single byte to 0 or 1 in the cache. At some later point in time, the entire cache will be sent out to RAM, but what if a thread alters the value of the lock when the updated value is in one core's cache, but not in RAM? Other cores cannot see each others caches, so when they check RAM, they may or may not see the most accurate value. Thus, we must have cache coherence.

Cache Coherence

Coherency is when other caches notice that one cache has changed something in the RAM. CPUs must snoop the bus for any CPU-Ram traffic, and invalidate their own cache each time they see a core talking to RAM in some area that they have cached. Thus, the caches will turn off and the CPU will interface with RAM, which is slower, but only some of the time. Ideally, we need a single instruction that can assign the lock and handle arbitration of the lock at the machine level.

Spin Locks

If you look at the machine code generated for a 128 bit integer, you’ll notice that the machine registers are either only 64 or 32 bits wide (depending on the machine). This is implemented by gcc by using a pair of instructions.

Example: Suppose we run the following:

*x = 0

And then we look at the machine instructions, we see:

movq
movq

There’s 2 64 bit instructions. This will cause a problem with our implementation of unlock. Why? There can be a race condition in between the two instructions. Thus, spin locks should not be too large.

Bit Fields

Spin locks should also not be too small. Contrary to belief, char is not the smallest variable type in C. You can actually declare a smaller variable. You can do the following:

struct pipe {
  unsigned char lock:1; //this is called a bit field
}

The way this is implemented, is that the compiler picks out a single bit in it’s next memory location and says “That’s the bit that corresponds to my lock field.”

Whenever you access this bit, it loads the corresponding byte, shifts it the correct amount and spits out the correct bit. Whenever you store the bit, it loads the value of the byte, changes the bit to what you wanted, and then does a corresponding store. But can you see a problem? There can be a race condition from the time you load the byte to the time you store the byte. Thus bit fields are also bad for this.

Spin Lock Sizes

How big should you make your spin lock? It depends on the hardware. On the x86 and x86-64, int works. 32 bit accesses are atomic as long as they’re aligned. The integers have to be on addresses of 4 so they are properly aligned.

How are we going to get locks to work?

incl

We need extra instructions for atomic access for “int”. The x86 and x86-64 have the following instruction (x is the address somewhere in memory):

lock incl x

Note: lock is an instruction prefix, so this is still a single instruction. All it means is “grab x’s value out of memory, add 1, and store the result. This is all done atomically.”

Cache coherency code knows about this. So if a CPU is executing this instruction, it sends out a warning across the bus saying that it’s going to add 1 to this location. Thus nothing else can access with this piece of memory while the change is happening. This will slow things down, but it is reliable. If 2 CPUs execute this instruction at the same time, 2 will be added to that location.

But can we implement locks based on this instruction? Technically yes, but it’s a big hassle and not enough to do general purpose locking.

xchgl

Another instruction we can use is the following (r being a register and x being a memory location):

xchgl r,x

It simply moves the contents of x into the register and the contents of the register into x. It does this atomically. If 2 different cores execute this at about the same time, it will act as if one happened first, and the other happened second. It won’t step on the other’s toes. Let's look at the following function:

int xchgl(int* p, int val) {
  int old = *p;
  *p = val;
  return old;
}

Note: This isn’t actually how the instruction is implemented. It’s implemented using asm atomically.

Using this function, we can write the following locking function:

void lock(int* l) {
  while(xchgl(l,1)); //sets the lock to 1
    continue; //checks the old value and if it was 1, that means someone else has the lock, so it needs to keep going
}

Thus xchgl allows for correct locking functionality.

So if we wanted to write code that set x = f(x), assuming f(x) is a simple function, we could also use xchgl. This is seen through the following code:

int old = x;
int new = f(x);
while(xchgl(&x,new) != old)
  continue;

But there’s a major bug in this code. After old is loaded to the value of x, by the time it’s invoked xchgl, something could have changed the value of x. Even when it’s made the mistake, it’s changed the value to new even though the old value is wrong. Furthermore, when it discovers old is wrong, it keeps looping, waiting for someone else to change x to be the old value.

Compare-and-Swap

In this scenario we have another instruction: compare-and-swap.

cas

The following is it's C implementation:

bool cas(int* p, int old, int new) {
  if (*p == old) {
    *p = new;
    return true;
  }
  return false;
}

This is a single instruction, so you can execute it and it will work atomically across multiple cores. We can also implement our x = f(x) primitive as follows:

while (1)
  int old = x;
  int new = f(x);
  if (cas(&x,old,new))
    break;

Note: Keep in mind, this code doesn’t care if x is changed by another issue. It will just keep looping.

Locking with Pipes

 struct pipe{
  char buf[1024]; // data
  lock_t l;
}

There are no locks in the data section of the pipe structure but the access to the data are protected by the accesses to the locks. In order to access the locks you use the machine intructions above such as compare and swap.

Currently the way readpipe and writepipe work is by first using a lock. Then we run the code for that specific thread. After that code has been run we unlock. The code that goes in between the lock and unlock is the critical section. Suppose we have multiple threads trying to access the same pipe. We can generalize this by saying that if N threads are trying to access the same pipe, then N-1 threads will spin. When a thread is spinning, its CPU will stay busy becuase it is constantly issuing machine instructions.

In order to save memory, suppose we change the implementation of the pipe to the following:

 struct pipe{
  char buf[1024]; // data
}
lock_t global_lock;

The readpipe and writepipe will work the same way as before. However, the only difference with this new implementation of the pipe struct is that the lock and unlock calls will use the global_lock. The critical section will still go between the lock and unlock calls similar to the original implemenation. This approach of using a single global lock for all pipes is called course-grained locking.

Advantage: Since there is only one lock for all pipes, this method is very easy to understand and implement.
Disadvantage: Using a global lock impacts perfomance by making the code slow. Also, access to the global lock becomes a performance bottleneck because anybody currently working on a pipe will prevent others from accessing that pipe.

The previous pipe struct where there was one lock in each pipe struct is known as fine-grained locking . Using this method causes the complexity to go up, but there are fewer bottleneck situations.

Now consider a new pipe struct that has two locks; one for read access and one for write access.

 struct pipe{
  char buf[1024]; // data
  unsigned r, w;
  lock_t rl; // read lock
  lock_t wl; // write lock
}

In this case, the readpipe will add 1 to the r pointer and the writepipe will add 1 to the w pointer. In theory a reader from the pipe will not interfere with any writers fromt he pipe, so you can read and write from a pipe at the same time. However there is an error in this reasoning. The read and write pipes both need to look at the r and w values, so using two locks in each pipe struct will not work. Possibly if you are able to partition your data structure into two sucomponenets, then this method might make sense, but generally using two locks in a pipe struct is not good coding practice. Overall the most typical pattern for using locks is to have one lock per struct.

Now going back to the original pipe struct with one non-global lock.

 struct pipe{
  char buf[1024]; // data
  unsigned r, w;
  lock_t l;
}

One issue with this approach is when you try to access a pipe, it will use up a lot of CPU time. For example, suppose there are many readers and a few writers. Here is the code that the readers will run:

while ((c = read_pipe) == CHAR_MIN-1)
  continue;
}

If there are a lot more readers than writers, then the readers will continue to read from pipes even though there is no data that needs to be read. Therefore the readers are unneccessarily using up CPU time. In order to fix this issue of using up unneeded CPU time, we can use the yield() as follows:

while((c = read_pipe) == CHAR_MIN-1)
  yield();
}

The yield call tells the CPU to switch to a different thread that will do more useful work. However even using the yield call will cause some issues. For example consider a case where there are some writer processes, reader processes and then some other processes such as in the diagram above. A simple scheduling mechanism will do the following: run the writers, then the readers. The issues is that all the extra readers will do useless work, because the read pipe is empty. It will finally do the other processes at the end. This is an issue because we are wasting time running all the read processes when they do useless work and simply waste CPU time. In the next lecture we will discuss how to skip past the unneeded reader processes.