CS 111 Lecture 8

Scribe Notes for 2/04/08

by Chris Swan and John Carter

Synchronization

Synchronization solves the problem of coordinating activities in a shared address space, like the one we have in a threaded system. This can either be a truly parallel system (i.e. a multiprocessor system) or just a multitasking OS.

Processes vs. Threads:

Threads share as much resources as possible. (can be good and bad, as we'll see).
Processes share as little as possible. (maximum isolation, but can be wasteful)

One of the major goals of synchronization is to avoid race conditions - situations in which behavior depends on execution order in an undesirable way. Ordinarily, threads are unsynchronized objects. They are allowed to execute in parallel with little to no communication between them. This is fine in some cases, such as when they are only accessing their own sets of data. In other cases it may spell disaster.

We want to accomplish the following:

Correctness - Our system should behave properly in all cases.
Efficiency and Performance - Our system should be fast and efficient in terms of utilization and wait time. A minimum of time should be spent waiting for other threads.
Clearness/Simplicity/Elegance - Our system should be easy to implement and use.

The following figure shows a situation that might benefit from synchronization:

In this diagram, several threads within a process are trying to access the same resource. Depending on the order and nature of their accesses, the behavior of this unsynchronized system could be unpredictable.

When problems begin to surface due to the unsynchronized nature of threads, several techniques may be employed to alleviate the anomalous behavior. The following example helps to illustrate the core idea of these type of problems before these solutions are outlined:

Example: Simultaneous Bank Withdrawals

Suppose we're designing a multithreaded bank ATM system that manages accounts. Accounts have a balance, which cannot be negative, and can be deposited to or withdrawn from at any time from any ATM. The simplest possible implementation of this functionality would look something like this:

void deposit(unsigned int amt) { balance += amt; }
void withdraw(unsigned int amt) { balance -= amt; }

Of course, the withdrawal code should make sure the account has enough money and should be changed to:

void withdraw(unsigned int amt) {
	int new_balance = balance - amt;
	if(new_balance < 0)
		return;
	balance = new_balance;
}

Now let's have two threads executing these primitives at the same time. Paul and his brother Kurt happen to share an account with a balance of $100:

balance = 100;
Paul: withdraw(100); //these two operations are performed
Kurt: withdraw(100); //at different ATMs at the same time

Pretend for a moment that we're using the version of withdraw() without error-checking, as it's irrelevant right now. Both Paul and Kurt's threads execute assembly code resembling the following:

withdraw(d){
	movl balance , %eax                  
	subl d , %eax                        
	movl %eax , balance                  
}

Paul and Kurt's ATMs both load the current balance, $100, into a register. They both subtract $100 from that value, and store the resulting value, $0, back into memory as the new balance. Both ATMs dispense $100 and the current balance is now $0. Everything worked perfectly!

...or did it? The bank has been defrauded of $100 because access to the shared balance variable was not protected! It's tempting to fix this problem with a global "busy" flag that's checked before manipulating balance, but that fails for the same reason -- both threads would check the value of the flag at the same time, then continue on as before. Before we attempt to solve this problem, let's look at the theory behind it.

Some theory and definitions

We can keep a multi-threaded environment running properly in two different ways:

Sequence coordination - task X must happen before task Y
Isolation - task X and task Y do not interfere with each other

This figure illustrates the execution paths of two threads colliding, which can lead to problems.

Ideally, the operations of two threads never collide (operate on the same data) and the system is therefore fully isolated. This situation is called "embarassing parallelism". In systems that are not embarassingly parallel, full isolation isn't possible, so we must rely on sequence coordination. This is accomplished through atomicity. An atomic operation enforces coordination: task X is done either all before or all after task Y, as a single unit. That is, the task is atomic if it is computationally indivisible.

Caution!

One should be careful in C for instructions that might appear to be indivisible such as balance+=amt; This instruction is not atomic because internally the machine code for it is implemented using a series of load, add, and store instructions which can be interrupted any time in between.

Some key metrics used to classify the properties of a synchronous system are observability and serializability.

Observability is a measure of what the thread or other observer's in the same context can "see". If a certain resource is out of the scope of the thread, we need not worry about shared accesses to it.
Serializability states that the implementation is correct if every observation is consistent with some purely sequential execution of actions. One of the goals of serializability is to make statements about the correctness of an implementation using a simple justification of observed behavior. The use of serializability can also optimize performance if the implementation is allowed to internally reorder operations for efficiency(this is done in most modern CPUs and compilers).

We implement atomic operations in software through critical sections. A critical section is a set of instructions executed by a thread such that atomicity is preserved if, at most, one thread's instruction pointer is in that section at any given time. In order to make critical sections work, this constraint must be enforced. However, we have two subproblems: mutual exclusion and bounded wait.

First we'll look mutual exclusion. Mutual exclusion means that only one thread has access to a certain resource at a given time. We might implement that through locking.

typedef int lock_t;
lock_t l;

void lock(lock_t *p);
void unlock(lock_t *p);

If a thread calls lock, it will wait until no other thread has a lock, and then it will take the lock. Unlock releases a lock. Both of these operations must be atomic. We could use the lock like this:

void deposit(unsigned int amt){
	lock(&l);
	balance += amt;
	unlock(&l);
}

The balance += amt operation is now considered a critical section, because even though it takes multiple machine instructions to perform, no other threads can mess things up due to the lock. If all other functions that touch balance are similarly surrounded by lock() and unlock(), the value of balance is now safe from race conditions (no more free money).

Of course, we still have to implement lock() and unlock(). Here's a naive implementation:

void lock(lock_t *p){
	while(*p) continue;
	*p = 1;
}
void unlock(lock_t *p){
	*p = 0;
}

This looks great! Too bad it doesn't work. Imagine two threads attempting to make a lock at the same time: we have the same race condition problem as before. To make lock() work, we need hardware support in the form of a new machine instruction. This is usually implemented in the form of the assembly instruction xchg, which swaps the value of two registers or memory locations. It can be used to implement a function called test and set that provides the following behavior:

int tas(int *p, int new){
	int old = *p;
	*p = new;
	return old;
}

Because it actually uses the xchg instruction, which is atomic, test and set is also atomic. Using it, we can re-implement lock() like so:

void lock(lock_t *p){ while(tas(p,1) == 1) continue; }

This is called a busy wait, or spinlock, because the thread wastes CPU cycles "spinning" until it gets the lock. Now our deposit and withdraw functions, armed with locks, are safe to use in a multithreaded environment. One thing to note, however, is this magical xchg instruction comes at a cost. The xchg instruction is rather slow and computationally expensive because it must take special care in coordinating bus activity to ensure no other CPU will interrupt it.

You should also note that this type of scheme relies on the assumption of read-write coherence. As long as reading and writing to an integer is atomic, we're in business. However, this assumption is often false. On the x86 architecture, loads and stores of 32-bit quantities that are aligned on 4-byte boundaries are guaranteed to be atomic. For data structure of larger or smaller size, such as large arrays or bit fields, atomicity is not guaranteed.

Imagine the following example:

struct s{ int foo[1000] ; } ; //large structure

struct s a,b,c;

a=b;

c=a; //might end up with a mixture of a and b, depending on

//the order of loads and stores

Similarly, for smaller objects:

struct s2{

unsigned int a:1;

unsigned int b:1;

...

};

struct s2 v;

v.a=1; //this converts into load, OR, and store operations

//so it's not atomic

That's not the whole story, though: we've neglected the earlier-mentioned problem of bounded wait. Imagine that Paul wanted to transfer some money by depositing money into the shared account for Kurt to withdraw. Kurt's a little impatient, and attempts to withdraw before Paul can deposit anything. When that request fails, he tries again. Eventually, he's just mashing the "withdraw" button on his ATM, waiting for the cash. When Paul finally submits his deposit, the lock is taken by the withdrawal thread. As soon as the lock is released, Kurt starts up another withdrawal, and that thread is now competing with the deposit to re-acquire the lock. It's very possible that the withdrawal threads will continue to beat the deposit to the punch.

The solution is to have some sort of queue that ensures threads are able to acquire locks in the order that they request them. If we implement this properly, we can also get rid of the busy-wait problem that spinlocks suffer from.

The standard solution to these problems is through implementation of what is known as a blocking mutex. A blocking mutex(a portmanteau of mutual exclusion) has a spinlock for access to the mutex itself and also contains a wait queue which contains a list of threads that are waiting to resume. Building upon the previous spinlock_t structure, we introduce the following:

typedef struct {

   spinlock_t l;

   proc_t *blocked_list; //waiting processes

} bmutex;

//API:

void acquire(bmutex_t *b);

void release(bmutex_t *b);

void notify(bmutex_t *b); //notify someone it has become

                          //available, FIFO(fairness)

void notifyAll(bmutex_t *b); //notify all waiting threads

                          //for user defined priority

void acquire{bmutex_t *b){

   lock(&b->l);

   //unlock quickly

   ...

   unlock(&b->l);

}

Although not exhaustively defined in these notes, the blocking mutex is a step forward in addressing the wastage problems of the spinlock and its strengths and weaknesses will continue to be explored in the lectures to come.