Notes by Thomas Anderson and Owen Horwitz
A critical section is a portion of code that should be run sequentially by a thread, without interrupts or context switches.
We will first discuss the importance of understanding critical sections. Suppose we are writing a function for a bank's internal multi-threaded account manager. Presumably we would want a function that can transfer some specified amount of money between two accounts, a function to deposit, and a function to withdraw. We might naively implement the transfer function like so:
void transfer(int* acct1, int* acct2, int amt) { acct1* += amt; acct2* -= amt; }
To the untrained eye this may look like a function that does its intended purpose, but there is a bug. There is a race condition. Suppose Paul Eggert has two accounts, one with $1,000,000 in it and one with $0 in it. Eggert is going to use his knowledge of critical sections to pull off the biggest heist ever committed by a UCLA faculty member:
transfer(Eggert_1, Eggert_2, 1000000);
Immediately after putting in his transfer request he is going to put in two withdraw requests (assume each request runs on a separate thread):
withdraw(Eggert_1,1000000); withdraw(Eggert_2,1000000);
bool withdraw(int* acct, int amt) { if(amt>acct*) return false; acct* -= amt; }
Now, suppose that after running line 3 of the transfer function, an interrupt occurs. At this moment, both of Eggert's accounts will have $1,000,000 in them. If both the withdraw functions are scheduled before the transfer request resumes execution, Eggert will walk away with an extra million in cash. The area where the race condition occurs is called a critical section.
To solve this problem we must first introduce the notion of a lock. A lock is a method that lets other threads know when we don't want them to modify a shared variable. Locks will only work in a cooperative environment. They don't actually do anything to limit access to data, so they only work if all the threads agree to obey the lock. We can define locks as a global variable stored in memory:
typedef int volatile lock_t;
The general idea is to establish a convention for the lock, such as, when the lock is set to one, other threads won't modify a certain area of memory, and when it is set to zero, they are free to lock that area for themselves. To do this we must check some preconditions:
Precondition for unlock -- The thread owns the lock.
Precondition for lock -- Nobody own the lock.
To implement our lock and unlock functions, we need to know about atomic instructions. In the x86 architecture there is support to guarantee some instructions will be completed before an interrupt with the lock prefix. The lock prefix in the x86 ISA applies to instructions that do a read-modify-write on memory, such as INC and XCHG.
int exchange (int *p, int val) { asm("lock xchgl...") } //grab the lock void lock (lock_t *p) { while(exchange(p,1)) //someone else owns the lock continue; } //give up the lock void unlock (lock_t *p) { *p=0; }
These instructions have the benefit of being atomic, but the downside of being slow. When the lock instruction is used, the system has to temporarily block the bus of all the other cores to prevent them from modifying the same area of memory, and clear the cached values for that section of memory. If used often this can greatly reduce system performance.
With this solution, critical sections are a bottleneck. Although they fix the correctness bug, we have introduced a new performance bug. In a dual core system, we will have at most one process waiting on another to get access to a locked object, but as Eggert pointed out, bottlenecks become worse with scaling. Suppose we ran our program on a 16 core system. We could have a 15 cores of the system all waiting on a single critical section. Because of this, we have arrived at two fundamental properties critical section code must abide:
1. Mutual exclusion -- two people can't unlock the same code (obvious)
2. Bounded weight -- We agree to not make other threads wait forever (ex. infinite loop). Longer code means longer wait, so we will KISS (keep it stupid simple).
With the second fundamental property, we want to minimize the amount of time we spend waiting on a lock. A simple fix to reduce the wait time is to only do atomic instructions within a critical section. We can do our read-modify-write instructions atomically provided that they follow three rules:
1. The object being accessed isn't too large -- We need the object to fit in a word, and have an ISA instruction that supports its access, so that we only need to modify one register.
2. The object being accessed isn't too small -- If the data type is smaller than the processor's smallest addressable unit, or the ISA's, the CPU will need to do additional arithmetic with shifts, boolean operations, and stores to access the object.
3. The object being accessed is aligned -- The data we are accessing could be of an appropriate size, but if it is misaligned, and either the ISA, or the processor, doesn't support unaligned access, it will take multiple reads, writes, and arithmetic operations to access the object.
With our new knowledge on locks we are ready to switch gears and examine how pipes are implemented. As could be expected, a pipe has a circular buffer that can be read from and written to, and pointers to the current read and write locations.
#define N 1024 struct pipe { char buf[N]; size_t r,w; } char read_pipe(struct pipe *p) { while(p->w == p->r) //no data to read continue; return p-> buf[p->r++ % N]; } char write_pipe(struct pipe* p, char c) { while(p->w -- p->r == N) continue; p->buf[p->w++ % N]=0; } //Note: to improve efficiency we could support larger data types than char
Once again there is a race condition. Right when a thread is about to read some data, an interrupt could occur. A second thread could read from the same pipe, and move the read pointer. When the first thread resumes execution it would read the same data a second time, which is not the correct behavior. For this reason, we need to lock the pipe whenever it is being accessed. An incorrect solution is to add a lock to the read and write functions as follows:
char read_pipe(struct pipe *p, lock_t l) { lock(&l); while(p->w==p->r) //no data to read continue; char retval=p-> buf[p->r++ % N]; unlock(&l); return retval; } char write_pipe(struct pipe* p, char c, lock_t l)) { lock(&l); while(p->w-p->r==N) //no space to write continue; p->buf[p->w++ % N]=0; unlock(&l); }
This appears to fix the correctness bug, but in fact it does not. We have violated the second fundamental property for a critical section; the wait is potentially unbounded. When there isn't any data to read, the read function will loop infinitely. When the buffer is full, write will loop infinitely. To fix this bug, we modify the code slightly:
#define N 1024 struct pipe { char buf[N]; size_t r,w; lock_t l; //support multiple pipes being accessed at once } char read_pipe(struct pipe *p) { lock(&p->l) while(p->w == p->r) { unlock(&p->l); yield(); //give control to another process lock(&p-l); } char retval=p->buf[p->r++ % N]; unlock(&p->l); return retval;; } char write_pipe(struct pipe* p) { lock(&p->l); while(p->w-p->r==N) { unlock(&p->l); yield(); lock(&p->l) } p->buf[p->w++ % N]=0; unlock(&l); }
This has fixed our correctness bug, and this implementation is known as a spin lock because each process polls the pipe to see if it is ready. We need something better than having each function yield every time there is a full or empty pipe. Our improved design should make sure a read or write won't get awakened without a good chance the pipe will be ready. We should have a way to keep track of who is waiting for the pipe, and have a strategy for what order these blocked processes run in. To accomplish this we will introduce the blocking mutex:
typedef struct { lock_t l; //spin lock bool locked; thread_t *blocked_list; } bmutex_t; void acquire(bmutex_t* b) { while(lock(&b->l),b->locked) { thread_t* l=b->blocked_list; b->blocked_list=self; //self is the current running thread. The kernel keeps track of this. self->next=l; unlock(&b->l); yield(); } b->locked=true; unlock(&b->l); } void release(bmutex_t* b) { lock(&b->l); b->lock=false; thread_t* next = b->blocklist; //next is the next scheduled thread. The kernel keeps track of this. if(next) { next->runnable=true; b->blocklist=next->next; } unlock(&b->l); yield(infavorofnext); /**infavorofnext is an argument to yield that would get the scheduler to prioritize next. The kernel would have a way to implement this.**/ }
While the implementation is correct, it still has some spinning and doesn't scale well. Imagine we have a pipe with 500 readers and 1 writer. If the writer writes to an empty pipe, 1 reader will wake to read the data, and then 499 more losing threads will wake up, realize there's no data in the pipe, and go back to sleep. To fix this problem, we need to introduce a new type of variable.
We introduce condvar_t to represent some condition that must be met to wake a thread. The condition is a boolean variable. For example, "is pipe empty?" or "is the pipe full?". The API adds two new functions:
//The thread calling wait_for must have aquired the mutex b. //wait_for releases b, waits until c becomes true, then finally reaquires b. wait_for(condvar_t *c, bmutex_t *b) //This function is called by some thread to wake up one thread waiting in wait_for on the same condition variable. notify(condvar_t *c)
Using this new API, we can modify our pipe to read using condition variables.
struct pipe { bmutex_t b; char buf[N]; size_t r, w; condvar_t nonempty; //represents p->w != p->r condvar_t nonfull; //represents p->w - p->r != N } char read_pipe(struct pipe *p) { acquire(&p->b); while(p->r == p->w) wait(&p->nonempty, &p->b); //a write_pipe thread needs to call notify(&p->nonempty) for wait to continue if the pipe is empty char c = p->buf[p->r++%N]; notify(&p->nonfull); release(&b->b); return c; } void write_pipe(struct pipe *p) { ... }
Now, our pipe implementation can read and write to pipes asynchronously in a way that scales well.
With the blocking mutex we have followed a trend of how higher level locks can be built out of lower level ones. We can generalize the concept even further by introducing semaphores. A semaphore works to limit the number of accesses to a resource. In fact, a blocking mutex is known as a binary semaphore because it allows either zero or one accesses to a resource at once. Semaphores help us to manage restraints such as "at most ten people can be logged in."