Lecture 8: Synchronization and (some) Scheduling

4/26/2007

by Richard Fillman and Jason Stoops

Synchronization

Thread synchronization

Imagine actual threads (of execution) walking through memory executing instructions

Threads in total different parts of memory are fine

The problems are at places of intersection

Sequence coordination:

Each thread is written as a sequential program, does each action in order.

Natural consequence of threads being sequential programs.

Next action can assume that previous actions work is done, and done properly.

Isolation

We want threads to not collide with each other.

When threads "talk" to different parts of memory, isolation is easy.

Collision points are the hard parts.

Two thread actions x & y should never interfere with each other.

Implement isolation using: atomicity

Enforces isolation

An action is either done to completion or not done at all.

From the point of view of an outside observer: never an "in between" state where an action is half done

Read/write coherence

Atomicity being applied to loads and stores.

Read/write coherence occurs when each load and store is either done or not done

No worries about half-loaded and half-stored stuff.

char buf[100];

strcpy(buf, "abcdefghi"); //suppose here, another thread examines buf. It sees "abcdefghi"

strcpy(buf, "wxyzqrstuv");

Is it possible for a thread to run right in the middle of our strcpy such that it sees neither the old nor the new value?

It depends on how strcpy is implemented.

Normally strcpy looks like this:

char *strcpy(char *a, char const *b) {

char *p = a;

while (*p++ = *b++)

continue;

return a;

}

strcpy is not an atomic operation, it doesn't give us isolation.

x86 has a special instruction called move bytes or move characters

So we can say:

movc3 buf, "abcdefghi"

But even still, this is not atomic.

underneath is a little micro engine that does loads and stores.

Some other thread on some other CPU could see some strange value in the meantime.

Even on a single CPU machine, these kinds of instructions are still interruptible.

The problem here is that we have large objects, which aren't atomic and you just have to live with it.

Sometimes small objects are not atomic either.

struct s {

unsigned int a:1;

unsigned int b:1;

char buf[100];

}

This is laid out in memory as such

One byte contains a and b (and 6 empty bits). And then there are 100 bytes for buf.

So suppose we do this:

struct s v;

v.b = 1;

Is this atomic? NO!

Why?

The machine loads the first word (containing a, b, and 3 bytes of buf), sets the top bit, and stores everything back into memory.

After loading, some other thread might set v.a = 1 and someone is going to stomp someone else’s data.

What is atomic then?

Look at ABI for your architecture

Assume nothing is atomic unless ABI says otherwise.

x86

Load/store of a 32-bit aligned word

Test and set function

Lock increment long (lock incl x) IS atomic. (Normal increment long is NOT atomic)

This is for speed reasons.

In c:

bool compare_and_swap(uint32_t *p, uint32_t old, uint32_t new) {

//store the new value into memory if *p == old value

//this is done by a single instruction and is atomic.

if (*p == old) {

*p = new;

return true;

}

else

return false;

}

How to implement lock increment long using compare_and_swap()?

compare_and_swap(&x, x, x+1);

This won't work because the 2nd and 3rd arguments may have been loaded at different times

Let’s try again:

uint32_t x0=x;

compare_and_swap(&x,x0,x0+1);

This way we load x only once, but x could still change in the meantime.

compare_and_swap will return false if that happens

RULE OF OS PROGRAMMING: never call a function that CAN return an error without CHECKING for errors.

So here we go:

uint32_t x0;

do{

x0 = x;

} while(!compare_and_swap(&x,x0,x0+1));

Imagine we're a bank

We want to eliminate race conditions on deposits and withdrawals

Can’t lose money for the bank or customers.

Tons of threads, people using atms, getting to exact same account.

typedef struct {

uint64_t balance; //balance in pennies.

} ba;

//need to go 64 bits because 32 bits = 4 billion pennies = 40 million dollars, and that’s not unreasonable...

void deposit(ba *p, uint64_t amt){

p->balance += amt;

}

void withdraw(ba *p, uint64_t amt){

if (amt <= ba->balance)

p->balance -= amt;

}

This is all well and good till we try and use it on a multithreaded server

Then we have problems

Start with $1.00

Person A Person B

deposit(10); deposit(20);

load 100 load 100 //uh oh. Need to load 110...

add to get 110 add to get 120

store 110 store 120

Solution: lock!

Coarse grained lock: protects all deposits and withdrawals, everyone waits on that lock to do anything

Fine grained lock: lock per account.

Rewrite to include mutex:

typedef struct {

uint64_t balance; //balance in pennies.

mutex_t m;

} ba;

void deposit(ba *p, uint64_t amt){

lock(&p->m);

p->balance += amt;

unlock(&p->m);

}

void withdraw(ba *p, uint64_t amt){

lock(&p->m);

if (amt <= ba->balance)

p->balance -= amt;

unlock(&p->m);

}

Rule for placing critical sections (for now)

Look for writers to shared objects

Look for dependent reads before writing.

Generate data written

Put the write and dependent reads into the critical section.

This assumes that a critical section always consists of several reads and one write.

If multiple writes, then merge critical sections.

If other instructions in the middle of reads and writes are unrelated, get them out of there

You want critical sections to be as lean and mean as possible.

uint64_t get_balance(ba *p) {

return p->balance;

}

This could get us into trouble since something might be in the middle of storing half of a 64 bit store.

(Assuming for purpose of argument that 64 bit loads are two 32 bit loads)

If we're caught in the middle of this action, we'll get some bogus number that is neither the old nor new value.

(This only applies to those of us with more than 40 million dollars but that's beside the point)

uint64_t get_balance(ba *p) {

lock(&p->m);

return p->balance;

unlock(&p->m); //this is REALLY bad because we never unlock!

}

rewrite to fix unlocking problem:

uint64_t get_balance(ba *p) {

lock(&p->m);

uint64_t v = p->balance;

unlock(&p->m);

return v;

}

Problem: performance bottleneck

This isn't a problem when we only have readers. Many simultaneous readers are fine

One writer and no readers are fine.

Solution: read/write lock

Allow either one writer or however many readers you want.

typedef struct{

mutex_t wm; //write lock

unsigned int readers; //number of readers

} rwlock_t;

void lock_for_read(rwlock_t *p) {

lock(&p->wm); //exclude other writers to the rwlock_t itself

p->readers++; //increment reader count (shared variable) (critical section)

unlock(&p->wm); //unlock to allow other readers to increment and get a lock.

}

void unlock_for_read(rwlock_t *p) {

lock(&p->wm);

p->readers--;

unlock(&p->wm);

}

void lock_for_write(rwlock_t *p){

lock(&p->wm);

for (;;){

//here we know there are no WRITERS

//may still be readers.

if (!p->readers)

return;

}

unlock(&p->wm);

}

void unlock_for_write(rwlock_t *p){

/* left as exercise for the reader */

}

Assume extreme politeness and accuracy in locking and unlocking.

Lock precondition: no lock.

Unlock precondition: it’s locked already.

OBSERVABILITY

Go back to day 1, concept of a system, interface, and external world

Things that are observable are things that go through the interface.

Atomicity doesn't refer to atomic actions inside the system

We care about observable atomicity; this is how you "measure" atomicity.

SERIALIZABILITY

Behavior of a system is correct if all observations are consistent with some sequence of atomic operations.

The idea is as follows:

Initial threads (B is balance, D is deposit, W is withdrawal)

T1 D D W B

T2 W D D

T3 B B W

Balance is the only observable function

We have 10 transactions overall.

3 observations are in there somewhere (the balance calls)

All we care about are that the observations are consistent with some ordering of the transactions

We don’t care what particular order, however.

Furthermore, we don't care what's in the other states, as long as the observations are right.

Important consequence of serializability - internal state need not be obvious or "consistent"

From an OS point of view, observations are system calls.

Messes may be faster if we don't need to look at them...

BENEVOLENT SIDE EFFECTS are allowed in implementations of atomic operations.

An effect on the state of the system that you like because it improves performance and the user doesn't notice.

Example: cache bank account.

Different sequences of events may leave cache in different states, but the user doesn't see the cache so it’s okay.

Can compare_and_swap implement deposit(p,amt) without a mutex?

Yes! If you have a compare_and_swap of the right size (64 bits in this case)

This is good, leads to performance improvements (mutexes slow things down)...

Scheduling

The scheduling problem:

You have N CPUs and >N threads that are runnable

Which threads should you run?

Other resources to schedule

Memory

Network

IO access

Question of what to schedule depends on what your scarce resource of the moment is.

Policies - rules of thumb you use to hand out resources.

Mechanisms - code for that work

Example:

Airline scheduling!

Resources: planes, crew, passengers, fuel

OS scheduling is easy by comparison.

We have more CPU time than we really know what to do with, disks, and network connections generally idle.

BUT our scheduler has to be way faster.

Scheduling metrics: (how do we know how well we're scheduling?)

Throughput or utilization: (useful time) / (total time)

But think about this:

You’re crunching some image while typing your thesis

High utilization can be achieved just by devoting 100% of time to image crunching

But that's inconvenient when you're trying to work on your thesis and things don't respond.

So, this leads us to another metric: turnaround time

Turnaround time = (time of completion) - (time of request)

Response time = (time user starts to see results) - (time of request)

Waiting time = (time the action starts) - (time of request)

Goal 1 - maximize utilization

Goal 2 - minimize turnaround/response/waiting time (depending on your point of view)

Goal 3 - "fairness" - each competing task gets a "fair" share of CPU

A sample scheduling policy

First come first serve (FCFS or FIFO)

Arrival

Job Time Workload

A 0 5

B 1 2

C 2 9

D 3 4

Timeline |--A--||B||----C----||-D-|

In between processing we have context switching however, so it's not just 20 seconds to execute.

Assume context switch takes x seconds