Lecture 8: Pre-Emptive Scheduling & Threads and Synchronization

Prepared by Phil Crumm and Ivan Petkov

Continuation of Scheduling Policies

Round Robin scheduling: A combination of preemption and First Come First Serve

Each task gets a slot of time to operate, say 10ms
The scheduler runs at the end of the time slot and schedules another task
The wait time to start executing is much lower, as a new program only needs to wait until all other programs in the queue execute for a single time slot.

"Polite" Round Robin:

New arrivals get the first available spot in the queue (which essentially behaves as a stack)
This policy is unfair as an older process will starve if there are many new comers

"Rude" or "Mean" Round Robin:

New arrivals are sent to the back of the queue
Much fairer as there is no risk of processes starving

The Main Problem in Software Engineering: Threads and Synchronization

Since threads share memory space, it is quite easy to write buggy code if memory access precautions are not taken into account. The system or compiler make no checks as to how data is being read and written, thus race conditions are guaranteed to appear when one thread reads a value while another is writing to it.

Bill Gates' Bank Account

We are given the task of designing a banking system for Bill Gates to use. He can deposit or withdraw money at any time he wishes, so long as he has the proper funds to do so. Here is our initial (and naive) approach:

				// Money stored in pennies
				// This should be large enough to encompass Bill's fortune
				long long int balance = 0;

				void deposit(long long int amt)
				{
					balance += amt;
				}

				bool withdraw(long long int amt)
				{
					if(amt <= balance)
					{
						balance -= amt;
						return true
					}

					return false;
				}

On a single thread, this code will run fine, however, let's suppose that Bill tries to deposit $100 (or 10000 pennies) into his account at an ATM, while his wife tries to withdraw $50 dollars from another machine.

Depending on the order that the two threads are executing we can observe unexpected results: If, for example, the deposit thread adds the $100 to the deposit while the withdraw thread is removing $50, the deposit can be erased all together. Looking at the assembly code we can see that the compiler moves the value in balance into a register, subtracts amt and stores it back. The increased value from the deposit thread has completely been blown away.

We can use the volatile to tell the compiler not to optimize access to balance and reload it's value before any computation. This does not fix the issue, as the computer cannot perform a computation on any value without first caching it into a register anyway. An interrupt can occur at any moment, even on single core machines, and another thread can update the value before the first thread has a chance to finish.

One approach to our problem would be to introduce machine instructions which mark the beginning and end of critical sections, or portions of code where the thread executing them should not be interrupted or the context switched. Several problems may arise from this approach, however:

We would like to enforce on each critical section a bounded wait: each critical section may run for at most <X> ms (e.g. 1ms). If a critical section runs for too long, many other threads are prevented from running.
If the programmer forgets to mark a section a scritical we are back to having race conditions, as threads are free to stomp on each other's data.
Undefined behavior occuring from releasing locks that have not been acquired.
Behavior of nested critical sections?
- Behavior is iffy, and depends on the specific machine or OS
- For now we'll assume nested critical sections are not permitted

Pipes and Critical Sections

We define our own pipes which will be used in our application internally, unlike the pipes which reside in kernel space and operate between proceses. Our pipes will work as circular buffers, and are limited to single byte I/O. Here is our initial implementation:

					enum { N = 1024 };

					// R is the total number of bytes read
					// W is the total number of bytes written
					// 0 <= R <= W
					// r = R % (1<<16)
					// w = W % (1<<16)
					struct pipe {
						char buf[N];
						unsigned r, w;
					};

					void writec(struct pipe *p, char c)
					{
						while(pipe_full(p))
							continue;

						p->buf[p->w++ % N] = c;
					}

					// readc could (if pipe is empty) wait or return -1.
					// In this implementation we will wait.
					char readc(struct pipe *p)
					{
						while(pipe_empty(p))
							continue;

						return p->buf[p->r++ % N];
					}

					bool pipe_empty(struct pipe *p)
					{
						return p->w == p->r;
					}

					// C black magic ensures that unsigned subtraction
					// when one number has overflowed, will also overflow
					// giving the correct answer.
					bool pipe_full(struct pipe *p)
					{
						return p->w - p->r == N;
					}

It can be easily seen that this code is vulnerable to race conditions. If, for example, there is a single byte of space left for writing in the pipe. If two threads simultaneously decide to write a single byte into the pipe it would overwrite data. To deal with this, we'll introduce a way of locking code through some primitives, whose implementation we will observe later:

					// Rules of engagement:
					// 1. It is the caller's responsibility of performing
					// a small task after a lock, and promptly unlocking after
					// 2. Do not lock something that's already been locked
					// 3. Do not unlock a resource that hasn't been locked
					// Otherwise BadThings™ will happen

					typedef ... spinlock_t;
					void lock(spinlock_t *l);
					void unlock(spinlock_t *l);

Now that we have a mechanism for identifying critical sections we can make our code safer for parallelization. Unfortunately the tools for this task are not sophisticated enough to perform the work for us, so we must look at the code, identify possible race conditions and mark critical sections by hand.

Much like Goldielocks we want to find a critical section of just the right size. If it is too large, then we violate the rule of bounded wait, and any other threads end up waiting for too long. If it is too small, the race conditions will not be encompassed as if no action was taken to prevent them.

To determine the right size of a critical section we need to take into account shared read and writes of data. If every thread only reads a particular data source then we need not worry about it spontaneously changing. Writing data introduces race contidions when there are gaps between when it is written and when it is read.

Let's update our code with using our spinlock_t type and keep in mind the size of the critical sections.

				void writec(struct pipe *p, char c)
				{
					for(;;)
					{
						lock(&l);
						if(!pipe_full(p))
							break;

						unlock(&l);
					}

					p->buf[p->w++ % N] = c;
					unlock(&l);
				}

				char readc(struct pipe *p)
				{
					for(;;)
					{
						lock(&l);
						if(!pipe_empty(p))
							break;

						unlock(&l);
					}

					char c = p->buf[p->r++ % N];
					unlock(&l);
					return c;
				}

				bool pipe_empty(struct pipe *p)
				{
					return p->w == p->r;
				}

				bool pipe_full(struct pipe *p)
				{
					return p->w - p->r == N;
				}

Say our application now has 1000 pipes and 5000 threads accessing them. Since we have one global lock used for the reading and writing of any pipe, this will bottle neck our program: one thread will run and 4999 of them will wait for the first to finish. If we use a finer grain of locks, rather than one coarse one, we will need to manage more locks, but we gain the benefit of locking up less data.

The pipe data structure can be rewritten to hold a lock pertaining only to that pipe.

Note: in C the -> operator binds tighter than the & operator.

				struct pipe {
					char buf[N];
					unsigned r, w;
					spinlock_t l;
				};
				void writec(struct pipe *p, char c)
				{
					for(;;)
					{
						lock(&p->l);
						if(!pipe_full(p))
							break;

						unlock(&p->l);
					}

					p->buf[p->w++ % N] = c;
					unlock(&p->l);
				}

				char readc(struct pipe *p)
				{
					for(;;)
					{
						lock(&p->l);
						if(!pipe_empty(p))
							break;

						unlock(&p->l);
					}

					char c = p->buf[p->r++ % N];
					unlock(&p->l);
					return c;
				}

				bool pipe_empty(struct pipe *p)
				{
					return p->w == p->r;
				}

				bool pipe_full(struct pipe *p)
				{
					return p->w - p->r == N;
				}

We still have a problem with our implementation, and that is that we are wasting lots of CPU cycles spinning and waiting for the resources to be freed. We can solve this by yielding to the kernel (via system call or a yield() primitive). Using this technique, however, a problem arises if a single thread is performing a large task: it gets slowed down by the rest of the threads having to wake up, see that the resources are blocked, and going back to sleep.

The task can be approached by modifying our spin locks to be blocking locks instead; we want to use a blocking mutex. Our new locking interface takes the form

				// The rules for locking are the same as before, except
				// now threads are blocked instead of continuously spinning

				// Rules of engagement:
				// 1. It is the caller's responsibility of performing
				// a small task after a lock, and promptly unlocking after
				// 2. Do not lock something that's already been locked
				// 3. Do not unlock a resource that hasn't been locked
				// Otherwise BadThings™ will happen

				typedef ... bmutex_t;

				// get the lock but wait if already locked
				void acquire(bmutex_t *b);
				void release(bmutex_t *b);
				bool try_acquire(bmutex_t *b);

Typically, spin locks are faster than mutexes, however, they are less efficient since they use up CPU cycles while waiting. Generally the strategy is to use spin locks when resources are usually free and rarely in use. Mutexes are used in the converse situation.

How do we actually implement the primitives for spin locks and mutexes? That is a lesson for another time...