CS111 Scribe Notes

Synchronization Primitives - Lecture 4/28

By: Andrew Look, Andrew Castner, Moises Lee

acastner@cs.ucla.edu

alook@ucla.edu

Synchronization Mechanism

Problems:

Race Conditions
Read-Write coherence
Large Objects
- T1: strcpy (buf, “Hello)
- T2: strcpy (buf, “world”)
- T1 can start writing Hel and then stop
- Then T2 will start writing wo which will over write He and make it wol
- In the end we might end up with something like worlo
- -->multiple stores to set object

Small Objects

Struct{
	unsigned int b1:1;
	unsigned int b2:1;
	char buf [10];
} s;

T1: s.b1= 0. 
T2: s.b2 = 0;

Assume that beforehand the bits in the word containing s are all 1's.

Because each assignment is carried out as a [load word], [assign value], [store word] series of instructions (in spite of the fact that the assignments affect individual bits), it's possible that one thread will load the word holding s (currently 0xFFFF) into a register, will be preempted by the second thread which will also load the word into memory (also 0xFFFF), perform its assignment (s.b2 = 0) and store the word (now 0xFF7F, for instance) back into memory.

Thread 1 will then continue, performing its assignment (s.b1 = 0), and then store the result (0xFFDF in this case) at the same spot that thread 2 stored its result. Thus we're left with only thread 1's assignment reflected in memory, which may not have been the desired result.

On a multiple core machine only one of the bits will be cleared, since the machine instruction operates at only 1 bit at a time
The actual load, stores, are on word boundaries
From this point we assume read-write coherence for 32-bit aligned loads and stores.

Example of unaligned load on an x86: load word (4 bits) from location 0x1001

-->This will perform an unaligned load in which the system will fetch 3 bytes from one word and one bit from the word at 0x1004. Since these two loads aren't simultaneous it's possible that another thread could write to one of the words before we've read it and thus leave us with a value that's incoherent.
We can accomplish synchronization with just this, but it's extremely difficult, so we employ a special machine instruction to help us:

	xchg %eax, (%ebx)

The 'xchg' assembly instruction exchanges the values the value in eax with the value at the memory location contained in ebx atomically (i.e. other instructions are executed either before these values have been touched or after they've been swapped. The instruction is known as “test-and-set” in IBM mainframes.

To model this instruction in C:

	int tas (int *p, int new){
		int old =  *p;
		*p = new;
		return old;
	}

All of this is done atomically with just one xchg instruction.

We can use this synchronizing mechanism to construct spinlocks with the following code:

	void lock (mutex_t *m) {  	// M needs to be a multiple of 4
		while (tas (m,1) == 1)  // Spin lock, ties the cpu and shouldnt be done on something 			
			continue;           // Causes a "busy wait"
	} 
  
	void unlock (mutex_t *m){
		*m=0;
	}
	// As a sidenote, m should point to an aligned address 
	//   to insure that loads and stores to it are atomic.

We can see the use of these locks by developing an implementation of Unix's pipe:

 
	enum {N = 1024};
	
	typedef struct {
		char buf[N];
		size_t r,w; 
	} pipe_t;

The next two functions assume there are no synchronization issues:

 
	void write(pipe_t *p, char c) {
		p-> buf [p->w++%N] = c;
	}
	// If w++ overflows the program will still work 
	//	 because size_t is unsigned and can hold a power of two  
	//	 larger than 1024. Thus it will roll over to 0 when w % N would.
	
	char read (pipe_t *p) {
		return p->buf[p->r++ %N];
	}

Here we attempt to fix the race condition the wrong way:

 
	void write(pipe_t *p, char c) {
		// While we're loading, someone else is writing
		while(p->w – p->r == N)  
			continue; 	
		p-> buf [p->w++%N] = c;
    } 
	
	char read (pipe_t *p) {
		while(p->w – p->r == 0)  
			continue;
		return p->buf[p->r++ %N];
    }

Use one global mutex for pipes. This is a coarse-grain lock.

	mutex_t m; 	
	void write (pipe_t *p, char c){
		lock(&m);
		while  (p->w – p->r != N){
			unlock (&m);
			lock (&m);
		}
		p->buf [p->w++%N]=c;
		unlock (&m);
	}

In order to avoid both locking and unlocking twice we can place the above code in a do-while loop:

	bool nonfull;
	void write (pipe_t *p, char c){
		do{
			lock(&m);
			nonfull = p->w – p->r != N
			if (nonfull)
				p->buf [p->w++%N]=c;
			unlock (&m);
		}while (!nonfull);
	}
	// The do-while read() primitive is left as an exercise to the reader

Problems with the above implementation:

(spinlock)*: we're spinlocking over and over, thus wasting CPU cycles busywaiting.

Using one global mutex doesn't scale well, as it limits access to all pipes (even if we have 100,000 pipes we can only access one at a time)

		To fix this we can put the variable mutex_t in the pipe struct, thus allowing each 				
		pipe to be accessed by only one thread at a time.  While more complex than the 				
		coarse-grained lock we employed earlier, this fine-grained lock will scale much better.

We can make an even more fine-grain lock by adding a separate read lock and a write lock:

    mutex_t rm;  	// Read 
    mutex_t wm;  	// Write

Can set up access such that any number of threads can read at once, but only one thread may write at any given time and will block readers out until that thread is done writing:

	typedef struct {
		mutex_t m;
		uint32_t readers;
	} rwmutex_t;
	
	void lock_for_read (rwmutex_t *p) {
		lock (&p->m);
		p-> readers ++; 	   //assumes no overflow
		unlock (&p->m);
	}

	void lock_for_write (rwmutex_t *p){
		for (;;){
			lock (&p->m);
			if (!p-> readers)
				return;
			unlock (&p->m);
			}
		}
	}	

	void unlock_for_read (rwmutex_t *p){
		lock (&p->m);
		p->readers--;
		unlock (&p->m);
	}

	void unlock_for_write (rwmutex_t *p){
		unlock (&p->m);
	}

General rule of thumb for finding critical sections

Look for problems:

Writes to shared state
Reads of nonword-objects (nonatomic reads)
Dependent reads (using the value of a read to determine a later write)

Spinlocks don't work well if we're going to be doing a lot of waiting, as they tie up the CPU.

We instead use a sempahore:

Controls a bounded number of resources (20 buffers, 1 keyboard, etc.)
Maintain a queue of threads that are waiting for the given resource(s) to become available.

Actual implementation:

An int that tells how many of the resource are available
A pointer to a linked list of process descriptors that are waiting
A spin lock

	Operations on a semaphore:
        prolaag – try to decrease n.  If n is positive we get one.
			       (also called P, acquire, grab, down)
        verhoog – increase
			       (V, release, up)

	// We can add a semaphore to our pipe implementation like so:
	char readc(pipe_t *p){
		P( &p->s ); 
		char c = p->buf[p->r++ % N];
		V(&p->s);
		return c;
	}
	// This only works if the pipe is nonempty, and thus should probably go in a loop

	// What would be more helpful here would be something along the lines of:
	// wait_for(p->w  –  p->r  !=  0);

To achieve something like this we'll need a new synchronization mechanism: condition variables

Condition variables serve as representatives of boolean conditions.

Using these variables requires:

A boolean condition
A blocking mutex to protect the condition
A condition variable to represent the condition

API for condition variables:

	wait(condvar_t *c, bmutex_t *b)
        // Precondition: 	B must be acquired.
        // Action:         Releases b then blocks until 
        //					another thread notifies us of the condition.  
        //                 Reacquires b, then returns

	notify(condvar_t *c)
        // Action:          Tells the OS to wake up one of the threads waiting on this condtion