4/26/2007
by Richard Fillman and Jason Stoops
Imagine actual threads (of execution) walking through memory executing instructions
Threads in total different parts of memory are fine
The problems are at places of intersection
Sequence coordination:
Each thread is written as a sequential program, does each action in order.
Natural consequence of threads being sequential programs.
Next action can assume that previous actions work is done, and done properly.
Isolation
We want threads to not collide with each other.
When threads "talk" to different parts of memory, isolation is easy.
Collision points are the hard parts.
Two thread actions x & y should never interfere with each other.
Implement isolation using: atomicity
Enforces isolation
An action is either done to completion or not done at all.
From the point of view of an outside observer: never an "in between" state where an action is half done
Read/write coherence
Atomicity being applied to loads and stores.
Read/write coherence occurs when each load and store is either done or not done
No worries about half-loaded and half-stored stuff.
char buf[100];
strcpy(buf, "abcdefghi"); //suppose here, another thread examines buf. It sees "abcdefghi"
strcpy(buf, "wxyzqrstuv");
Is it possible for a thread to run right in the middle of our strcpy such that it sees neither the old nor the new value?
It depends on how strcpy is implemented.
Normally strcpy looks like this:
char *strcpy(char *a, char const *b) {
char *p = a;
while (*p++ = *b++)
continue;
return a;
}
strcpy is not an atomic operation, it doesn't give us isolation.
x86 has a special instruction called move bytes or move characters
So we can say:
movc3 buf, "abcdefghi"
But even still, this is not atomic.
underneath is a little micro engine that does loads and stores.
Some other thread on some other CPU could see some strange value in the meantime.
Even on a single CPU machine, these kinds of instructions are still interruptible.
The problem here is that we have large objects, which aren't atomic and you just have to live with it.
Sometimes small objects are not atomic either.
struct s {
unsigned int a:1;
unsigned int b:1;
char buf[100];
}
This is laid out in memory as such
One byte contains a and b (and 6 empty bits). And then there are 100 bytes for buf.
So suppose we do this:
struct s v;
v.b = 1;
Is this atomic? NO!
Why?
The machine loads the first word (containing a, b, and 3 bytes of buf), sets the top bit, and stores everything back into memory.
After loading, some other thread might set v.a = 1 and someone is going to stomp someone else’s data.
What is atomic then?
Look at ABI for your architecture
Assume nothing is atomic unless ABI says otherwise.
x86
Load/store of a 32-bit aligned word
Test and set function
Lock increment long (lock incl x) IS atomic. (Normal increment long is NOT atomic)
This is for speed reasons.
In c:
bool compare_and_swap(uint32_t *p, uint32_t old, uint32_t new) {
//store the new value into memory if *p == old value
//this is done by a single instruction and is atomic.
if (*p == old) {
*p = new;
return true;
}
else
return false;
}
How to implement lock increment long using compare_and_swap()?
compare_and_swap(&x, x, x+1);
This won't work because the 2nd and 3rd arguments may have been loaded at different times
Let’s try again:
uint32_t x0=x;
compare_and_swap(&x,x0,x0+1);
This way we load x only once, but x could still change in the meantime.
compare_and_swap will return false if that happens
RULE OF OS PROGRAMMING: never call a function that CAN return an error without CHECKING for errors.
So here we go:
uint32_t x0;
do{
x0 = x;
} while(!compare_and_swap(&x,x0,x0+1));
Imagine we're a bank
We want to eliminate race conditions on deposits and withdrawals
Can’t lose money for the bank or customers.
Tons of threads, people using atms, getting to exact same account.
typedef struct {
uint64_t balance; //balance in pennies.
} ba;
//need to go 64 bits because 32 bits = 4 billion pennies = 40 million dollars, and that’s not unreasonable...
void deposit(ba *p, uint64_t amt){
p->balance += amt;
}
void withdraw(ba *p, uint64_t amt){
if (amt <= ba->balance)
p->balance -= amt;
}
This is all well and good till we try and use it on a multithreaded server
Then we have problems
Start with $1.00
Person A Person B
deposit(10); deposit(20);
load 100 load 100 //uh oh. Need to load 110...
add to get 110 add to get 120
store 110 store 120
Solution: lock!
Coarse grained lock: protects all deposits and withdrawals, everyone waits on that lock to do anything
Fine grained lock: lock per account.
Rewrite to include mutex:
typedef struct {
uint64_t balance; //balance in pennies.
mutex_t m;
} ba;
void deposit(ba *p, uint64_t amt){
lock(&p->m);
p->balance += amt;
unlock(&p->m);
}
void withdraw(ba *p, uint64_t amt){
lock(&p->m);
if (amt <= ba->balance)
p->balance -= amt;
unlock(&p->m);
}
Rule for placing critical sections (for now)
Look for writers to shared objects
Look for dependent reads before writing.
Generate data written
Put the write and dependent reads into the critical section.
This assumes that a critical section always consists of several reads and one write.
If multiple writes, then merge critical sections.
If other instructions in the middle of reads and writes are unrelated, get them out of there
You want critical sections to be as lean and mean as possible.
uint64_t get_balance(ba *p) {
return p->balance;
}
This could get us into trouble since something might be in the middle of storing half of a 64 bit store.
(Assuming for purpose of argument that 64 bit loads are two 32 bit loads)
If we're caught in the middle of this action, we'll get some bogus number that is neither the old nor new value.
(This only applies to those of us with more than 40 million dollars but that's beside the point)
uint64_t get_balance(ba *p) {
lock(&p->m);
return p->balance;
unlock(&p->m); //this is REALLY bad because we never unlock!
}
rewrite to fix unlocking problem:
uint64_t get_balance(ba *p) {
lock(&p->m);
uint64_t v = p->balance;
unlock(&p->m);
return v;
}
Problem: performance bottleneck
This isn't a problem when we only have readers. Many simultaneous readers are fine
One writer and no readers are fine.
Solution: read/write lock
Allow either one writer or however many readers you want.
typedef struct{
mutex_t wm; //write lock
unsigned int readers; //number of readers
} rwlock_t;
void lock_for_read(rwlock_t *p) {
lock(&p->wm); //exclude other writers to the rwlock_t itself
p->readers++; //increment reader count (shared variable) (critical section)
unlock(&p->wm); //unlock to allow other readers to increment and get a lock.
}
void unlock_for_read(rwlock_t *p) {
lock(&p->wm);
p->readers--;
unlock(&p->wm);
}
void lock_for_write(rwlock_t *p){
lock(&p->wm);
for (;;){
//here we know there are no WRITERS
//may still be readers.
if (!p->readers)
return;
}
unlock(&p->wm);
}
void unlock_for_write(rwlock_t *p){
/* left as exercise for the reader */
}
Assume extreme politeness and accuracy in locking and unlocking.
Lock precondition: no lock.
Unlock precondition: it’s locked already.
Go back to day 1, concept of a system, interface, and external world
Things that are observable are things that go through the interface.
Atomicity doesn't refer to atomic actions inside the system
We care about observable atomicity; this is how you "measure" atomicity.
Behavior of a system is correct if all observations are consistent with some sequence of atomic operations.
The idea is as follows:
Initial threads (B is balance, D is deposit, W is withdrawal)
T1 D D W B
T2 W D D
T3 B B W
Balance is the only observable function
We have 10 transactions overall.
3 observations are in there somewhere (the balance calls)
All we care about are that the observations are consistent with some ordering of the transactions
We don’t care what particular order, however.
Furthermore, we don't care what's in the other states, as long as the observations are right.
Important consequence of serializability - internal state need not be obvious or "consistent"
From an OS point of view, observations are system calls.
Messes may be faster if we don't need to look at them...
BENEVOLENT SIDE EFFECTS are allowed in implementations of atomic operations.
An effect on the state of the system that you like because it improves performance and the user doesn't notice.
Example: cache bank account.
Different sequences of events may leave cache in different states, but the user doesn't see the cache so it’s okay.
Can compare_and_swap implement deposit(p,amt) without a mutex?
Yes! If you have a compare_and_swap of the right size (64 bits in this case)
This is good, leads to performance improvements (mutexes slow things down)...
The scheduling problem:
You have N CPUs and >N threads that are runnable
Which threads should you run?
Other resources to schedule
Memory
Network
IO access
Question of what to schedule depends on what your scarce resource of the moment is.
Policies - rules of thumb you use to hand out resources.
Mechanisms - code for that work
Example:
Airline scheduling!
Resources: planes, crew, passengers, fuel
OS scheduling is easy by comparison.
We have more CPU time than we really know what to do with, disks, and network connections generally idle.
BUT our scheduler has to be way faster.
Scheduling metrics: (how do we know how well we're scheduling?)
Throughput or utilization: (useful time) / (total time)
But think about this:
You’re crunching some image while typing your thesis
High utilization can be achieved just by devoting 100% of time to image crunching
But that's inconvenient when you're trying to work on your thesis and things don't respond.
So, this leads us to another metric: turnaround time
Turnaround time = (time of completion) - (time of request)
Response time = (time user starts to see results) - (time of request)
Waiting time = (time the action starts) - (time of request)
Goal 1 - maximize utilization
Goal 2 - minimize turnaround/response/waiting time (depending on your point of view)
Goal 3 - "fairness" - each competing task gets a "fair" share of CPU
A sample scheduling policy
First come first serve (FCFS or FIFO)
Arrival
Job Time Workload
A 0 5
B 1 2
C 2 9
D 3 4
Timeline |--A--||B||----C----||-D-|
In between processing we have context switching however, so it's not just 20 seconds to execute.
Assume context switch takes x seconds