CS 111 Winter 2016

A brief review of the last lecture - Goldilocks Principle

The Goldilocks Principle says that critical sections, which are important to synchronization, should not be

too small, or many races could still occur
too large, because then there is not much parallelism and your code is terrible

In these critical sections, there are shared data or resources that should only be accessed or modified by one user at a time. To lock critical sections, we use code like so:


	lock(obj);
	perform delicate actions on obj
	unlock(obj);

Example: Bank Account Transfers

Let us consider an example where we are transferring money from an account A to account B. To ensure correctness of transfer, we do not want anyone performing any operations on the accounts while we are transferring the money. We can use locks to transfer money like so:


	lock(account A);
	remove money out of account A;
	unlock(account A);

	lock(account B);
	add money into account B;
	unlock(account B);

However, notice that if a bank auditor takes an audit between unlock(account A) and lock(account B), the total balance in the bank won't add up. This means our critical section is too small! We can solve this issue by moving our locks and unlocks like so, locking both accounts before any transfers are made:


	lock(account A);
	lock(account B);
	remove money out of account A;
	add money into account B;
	unlock(account B);
	unlock(account A);

However, this solution is still non-optimal because we use multiple locks. If someone already has a lock on one of the resources, in this case account A or account B, we end up waiting for an indefinite time, with a potential for deadlocking if the critical section is too large.

Example: Auditing the Bank

As demonstrated previously, taking an audit can result in problems if our critical sections are too small and we audit in the middle of a transfer. Considering our proposed solution, if we decide to audit the entire bank, we will need to put a lock on all accounts. Clearly, this solution is not optimal as no transactions can be made during that time. It is also just as difficult to find an optimal time to perform the audit. If any transactions are being made while we request an audit, we need to wait for those transactions to complete and release the lock on the accounts. We can alternatively forcefully boot everyone off of the system, but this will result in unhappy users.

Putting the problem of finding an optimal time aside, one solution to avoid locking all accounts involves assuming that our file system will be able to take a snapshot. We will then perform the audit on the snapshot, guaranteeing that no users will request locks on anything in the snapshot (as it does hold any actual accounts, only the status of them at a specific time). Implementing file system snapshots is fairly advanced to implement, which will be a problem for another day.

Mechanisms

So how do actually implement synchronization? We will cover just a few of many ways in today's lecture, keeping in mind our higher-level goals.

Implementing on a uniprocessor

Assumptions

We have one processor
No other process is running

Solutions

Disable interrupts, especially the clock interrupt, during critical sections (on Linux we can use pthread_sigmask() temporarily)
Use cooperative multitasking so no context switching is forced during a critical section

Example: Basic Pipe (Implementing on a multiprocessor)

Let us assume a sample program involving pipes that we are implementing on our own. We will begin this example by discussing how to implement it on a uniprocessor with multiple threads, then discuss the problems facing us when implementing it on a multiprocessor.

We define a simple pipe with the following code:


	struct pipe {
		char buf[1024];
		unsigned r, w;
	}

To keep things simple, we will implement read and write system calls for 1 byte. These implementations take into account the possibility that the buffer is full. A developed example implementation of read and write is as follows:


	bool write_pipe(struct pipe *p, char c){)
		if(p->w - p->r == 1024)
			return false;
		lock(&p->lock)
		p->buf[p->w++%1024] = c;
		unlock(&p->lock);
		return true;
	}
	int read_pipe(struct pipe *p){
		if(p->w == p->r)
			return CHAR_MIN-1;
		return p->buf[p->r++%1024];
	}

Note that there is a problem: Before the lock has been secured in write_pipe, the values of w and r could be modified by another thread after the if statement but before the call to lock. To solve this, we can do the following:


	bool write_pipe(struct pipe *p, char c){)
		lock(&p->lock)
		bool full = (p->w - p->r == 1024);
		if (full)
			p->buf[p->w++%1024] = c;
		unlock(&l);
		return !full;
	}

We can similarly modify read_pipe with locks to ensure r is not being modified between the time we check for an empty pipe and return a value:


	int read_pipe(struct pipe *p){
		lock(&p->l);
		int c = (p->r != p->w) ? p->buf[p->r++%1024] : CHAR_MIN - 1;
		unlock(&p->l);
		return c;
	}

Developing a Synchronization Primitive: Spinlocks

We can implement lock and unlock as either:

As system calls (easily doable, but often times not fast enough)
As plain machine instructions (difficult, but very fast)

Because system calls will incur a heavy performance cost, we will implement it in machine instructions.

Spin locks are a particular type of lock in which a thread trying to acquire a resource will simply wait in a loop and "spin", constantly checking if the lock is available. Because spin locks employ this sort of busy waiting, they consume valuable CPU time. However, they are particularly easy to implement and are additionally lightweight (in part due to utilizing machine instructions). In this section, we will formulate the implementation spin locks, working our way from a software model of lock implementation and adapting it into machine instructions.

Spin locks should not be:

too large (e.g. don't use typedef __int 128 lock_t as that requires two 64-bit instructions to process, making it non-atomic on the hardware level)
too small (e.g. don't use unsigned char lock:1; bitfields, as that requires the ALU to shift your word to isolate the bit field and compute with it)

On x86 and x86-64 computers, int is acceptable as long as it is aligned


	typedef int lock_t;

	/*precondition: lock not held by the caller
	postcondition: lock held by the caller*/
	typedef char lock_t;
	void lock(lock_t *l) {
		while(*l)
			continue;
		*l = 1;
	}

	/*precondition: lock is held by the caller
	postcondition: lock no longer held by the caller*/
	void unlock(lock_t *l) {
		*l = 0;
	}

However, notice that there is a race condition present in the lock function. So our lock needs a lock which needs a lock... which means we must turn to the hardware level!

The Cache Coherency Problem

Before we turn to implementing locks via machine instructions, let's discuss how memory access works and what problems we can run into. Cache lines grab a large number of bytes, for example, 32 bytes at a time. So all loads and stores to RAM are 32 bytes at once.

In a uniprocessor, each process does not know how other processes are modifying the cache. Process A may be trying to acquire a lock on resource X and checks the lock in cache, seeing that it is available. After the check, we suddenly context switch to process B. Process B then acquires the lock on resource X. Upon switching back to process A, process A believes that it can then acquire the lock when it has already been snatched away previously by B. The takeaway for cache coherency on a uniprocessor is that operations on a memory location must happen in order atomically.

In a multiprocessor, since each cache is only visible to its associated CPU, we run into a problem if processes running on separate CPUs want to access the same resource. If a process on CPU A accesses resource X and puts a lock on it, that lock may not be reflected in CPU B's cache, especially if CPU B previously loaded that resource into its cache. This problem results in inconsistent caches, leading CPU B to believe that it can use resource X when it should not be doing so.

Disable cache (bad solution, results in a heavy slowdown, solves most cache coherency problems)
Snoop the bus and invalidate cache (not optimal solution either, can slow down the program, but helps solve cache coherency on multiprocessor)
Use an instruction that can make the lock function atomic (solves uniprocessor problem)

Returning to Spin Locks: Implementing with Machine Instructions

As formulated previously, we need a machine instructions for atomic accesses. To assist us in implementing atomic access, modern processors have instructions such as:


	lock incl (mem_addr)

By prepending lock to the instruction, the CPU sends out a warning over the bus, invalidating caches on other processors, enabling us to lock atomically and maintain cache coherency.

We also use the machine instruction xchgl, which exchanges two values atomically. An example of its operation is shown below in C, but is atomic in hardware:


	//store val into *p and return original value held by p
	int xchgl(int *p, int val){
		int old = *p;
		*p = val;
		return old;
	}

Upon trying to acquire a lock, we can use the code below, where we atomically check for whether the lock is available. Since the lock is not available unless the value stored in l is 1, we will "spin" inside the while loop until the lock is available, where the value stored in l is 0.


	void lock(int *l){
		while(xchgl(l, 1))
			continue;
	}

While xchgl does enable us to build our spin lock primitive, it is still not enough, since we did not check for coherency in our cache. We introduce a new instruction, Compare And Swap (CAS), which will let us check if the data x we are operating on was modified (and hence invalidated) while f(x) was running, regardless of whether or not we acquired the lock:


	//compare and swap instruction
	//if *p == old, store new into p
	bool cas(int *p, int old, int new){
		if(*p == old){
			*p = new;
			return true;
		}
		else
			return false;
	}

	do {
		int old = x;
		int new = f(x);
	} while (cas(&x, old, f(x));

Key note: Machine instructions enable us to make bigger building blocks in software such as spin locks, enabling us to do neat things like multi-threading. Our data buffer in the pipe can now be adequately protected by the lock_t data type in our pipe struct. We have successfully created three different machine instructions, each with varying capabilities for enabling synchronization primitives:


	asm("lock incl m");
	asm("xchgl r, m");
	asm("cas r, m")

Returning to the Pipe: Further Problems

It may still be disadvantageous to use a spin lock as a global lock, because if we have N threads accessing any pipe, at worst case we will have N-1 threads spinning while the pipe is being written or read from. If N is a large number of threads, we end up with a large performance bottleneck as N-1 threads "spin", chewing up valuable CPU time.

Coarse-Grained Locking, e.g. global lock

Good: Simple to implement and use
Bad: As shown previously, can be a performance bottleneck

Fine-Grained Locking

Good: Fewer bottlenecks
Bad: Complex to implement

We also do not want to use separate read and write locks, because it's not standard and easily adaptable. If we have many readers and few writers, we could use the following solution:


	reader: //this involves cooperative multitasking!
	while((c = read_pipe(&p)) == CHAR_MIN-1)
		yield(); //so as not to continue eating CPU time while attempting to read
	do_work(); //reach this when successfully read

Problems for the Future

What if, as posited before, we have a small number of writers and a large number of readers waiting on a pipe (useless readers)? It is inefficient for each reader to yield in order, as having each reader attempt to read the pipe, realizing there is a lock, and then yielding to the next process consumes valuable CPU time. We need our scheduler to have some way of skipping over all those readers. Tune in next time to find out how!

CS 111 Winter 2016 - Lecture 9: Synchronization primitives; deadlock

Albert Liu, Daniel Lau, and Zixuan Fan

Lecture Topics:

Theory and Policy Issues