CS111: Scribe Notes for 10/28/13

Critical Sections

What is a Critical Section?

A section of code that must be run from beginning to end without being interrupted. Alternatively: where at most one thread can be run at a time

Our data structures are delicate and can't handle being stomped by multiple threads or multiple executions

Why are interrupts bad for Critical Sections?

The interrupt handler might reexecute the Critical Section if its gets interrupted. However critical sections can prove to be a limitation where we have the idea of 2 different threads each writing to two different pipes both calling the write_c function which violates this idea of Critical Sections.

Our revised definition of a critical section: A section of code that must be run from beginning to end without being interrupted. Alternatively: where at most one thread can be run at a time where only one instance of data can be modified at a time

This ensures that such functionality can still preserve the idea of Critical Sections

Let us assume:

1. Single threaded

2. No signals

3. No multitasking - no clock interrupts

How to identify Critical Section:

Big block of code: 1,000,000 lines of code

Method #1: (Have on lock and one unlock for all the code)


lock();
//block of code 1 million lines long
...
...
unlock();

However the problem is that if critical sections are too large this results in bad performance and a loss of parallelism.

Method #2: (Take every pair of instructions and add an lock/unlock pair)


lock()
//2 lines of code
unlock()

lock()
//2lines of code
unlock()

Problem: Overhead of lock/unlock will hurts performance and introduces race conditions.

Method #3:

Goldilocks - just the right size for the critical sections.

How to find the right size for a Critical Section?

1. Look for shared data specifically shared writes: assignments,modifications,etc.

2. Worry about writing and reading at the same time. ( look for dependent reads)

3. Find minimal critical section that will contain all of this.


Example: 
//x,y,z shared variables, DR is dependent read, W is write
DR int j=z+1;
nananan() // unrelated function to x,y,z and move it out of the Critical Section to improve performance.
W y=j;
W x=i+y;
// this is a critical section
// y is a problem as it being modified and used for another write.
// i and j can also be modified at the same time too

How to enforce Critical Sections

How do we implement lock and unlock and how do we maintain read/write coherence

For example

[CPU] [CPU]

[CACHE] [CACHE]

| |

------------------------------------BUS

[RAM}

We have two cpus each connected to their respective caches all connected to a bus. What if one CPU decides to do a store and other tried to load from the same location in memory? The first CPU will store into cache and then store into RAM, and the second CPU will load from RAM and get the "wrong" value. There is a race condition that has to be dealt with!

What are possible solutions:

1. NO cache coherence

2. NO caches

3. Snooping - (whenever you do a store - immedieatly store the cache into memory) - the other caches snoop the bus and keep track of the new cache. Write-through - writes to RAM directly.

We have to assume that there is some support at the Hardware level

for the x86:

	typedef struct {
	int cpu_id;
	int counter;//how many times has this been used
	//other stuff
} lock_t;

void lock(lock_t *l){
	l->cpu_id=self_cpuid;//our cpu_id
	l->counter=0;
	//even if we can assume that both instructions are atomic - the entire lock instruction isn't atomic
	//there could be someone who looks inside inbetween the two instructions and sees a partially done lock
}

Obviously this lock won't suffice as the entire instruction is not atomic

typedef long long int lock_t; //8 byte integer
//4 byte alignment - could straddle the boundary of the cache lines
[32 bits][32bits] - write 4 bytes then write another 4 bytes - someone could read at the same time

void lock(lock_t *l){
	*l=self_cpuid;
}

typedef char lock_t;
void lock(lock_t *l)
{
	*l=1;
}

New problem: - not cache line, but the processor might load a word from RAM,replace the affected byte by 1,store into RAM

the problem is that someone else could be locking within the same word.

Objects that are too small are not coherent

Objects that are too large are not coherent

Our only solution is to consult the hardware manual and for the x86 load and store 4 byte line words as the natural coherence size


typedef int lock_t;
void lock(lock_t *l)
{
	
	while(*l)//race condition possibility
		continue;
	*l=1;	
}
void unlock(lock_t *l){
	*l=0;
}
//2 people can both use the same lock and actaully get the lock.
// lock incr x; where x is a memory location

Still we run into problems. lock incr x is an instruction that will atomically increment a cell in memory. This is an expensive operation, lock incr has to broadcast to every cpu that it is doing this increment and every cpu won't touch the cell. However this does not solve the problem of when we are doing the checking: there is still a race condition then.

xchg r,x

this is an exchange a contents of a register r to the memory location x.

By using xchg we can check the value in memory atomically

inline int xchg(int *p, int val){
	//int v = *p;
	//*p=val;
	//return v;
	return asm("xchg %ar, %m,val, p);
}

void lock(lock_t *l)
{
	while(xchg(l,1));//spin lock - costs a lot of cpu time
		continue;	
	*l=1;	
}
//short critical sections justify the cost < 10 instructions.
//unlocking - everyone is running xchg so theres no problem

Congratulations we have created a spin lock - this will spin on and on waiting for the xchg to complete its work and then give the caller the lock

Build higher level synchronization out of lower level primitives


int xpf(int) //expensive function - pure function - only looks at args not other storage
x=xpf(x); // if x is shared - this wont work this will create a race condition
lock_t l;

lock(&l);
x=xpf(x); //xpf takes too long - violate the goldilock principle
unlock(&l);
//correct but bad performance

lock_t l;
int ans=xpf(x);//another race condition for x
lock(&l);
x=ans;
unlock(&l);
//incorrect but good performance

Compare and swap

int compare_and_swap(int *p, int old,int new){
	int oldval=*p;
	if(oldval==old){
	*p=new;
	return 1;
	}
	return 0
}
//assume this function is atomic

//heres how to use this function:

int old;
int ans;
do{
	old=x;
	ans=xpf(old);
}while(!compare_and_swap(&x,old,ans));
//critical section is in the compare and swap- no critical section.

There is a compare and swap instruction - cas in the x86. There are no locks in this implementation, but the object must fit in 32 bits.

We want to scale better than just an int.

We could embed a lock_t into the larger object in order to lock that object.

Every object got its own lock.


int readc(sruct pipe*p,..){
	lock(&p->l);
	
	unlock(&p->l);
}
//still a performance problem - a number of threads who all want to write on the same pipe - all trying to lock
//every thread will try to lock -all busywaiting on every single one except for the one who owns the lock. As the number of threads the busywaiting increases.

How do we solve this problem?

Blocking Mutexes

Use a different synchronization called Blocking Mutexes or Mutexes a mutaxe doesn't spin when it doesn't have the lock. When you want the lock and someone else has it, you block, if you block, let someone else run and tell the OS don't schedule me please. (tell the OS to not run until the blocking mutex is released)


int readc(sruct pipe*p,..){
	aquire(&l->m)
	
	release(&p->m);//unlock the mutex and notify whoever owns the lock
}
typedef struct {
	lock_t l;
	bool acquired;
	//set of process waiting for the mutex to be released
	proc_t *blocked_list; //assume there is next field as this is a linked list
}bmutex_t; //lots of threads examine mutexes so we need a lock in the mutex, race conditions etc.

void acquire(bmutex_t *m){
	for(;;){
	lock(&m->l);
	if(!m->acquired)
		m->acqured=1;	
	else{
		//append self (thread id) to m->blocked_list - append to a linked list
		unlock(&m->l)
		//own->thread->runnable=0; 
		yield(); //tell scheduler to run somebody else.
		
	}
	}
}
//set aquired to 0 check if someone else has been trying to get this mutex and wake them up
void release(bmutex_t *m)
{
	lock(&m->l);
	m->aquired=0;
	while (m->blocked_list)
	{
		m->blocked_list->runnable=1;
		m->blocked_list=m->blocked_list->next;
	//release - has to notify one or all of the processes?
	//also has to remove the recently released process
	}
	unlock(&m->l);
}

Theres still a peformance problem


	readc( struct pipe *p)
	for(;;)
		acquire(&p->m)
		if(p->r==p->w)
			//empty pipe
			break;		
	release(p->m)
	}
	real work
	release(&p->m)
}
//theres still a cpu problem - 100 people trying to read an empty pipe - wake up one process - acquire lock - release lock - check next process- eats up cpu time

Build another primitive on top of blocking mutexes - condition variables.

Condition variables

waitfor("p->r!=p->w") is we want. A wait until this condition is true function which we can't say in C;

application decides which conditions it cares about and declares its own conditions.

For example: for a pipe some possible conditions are

pipe_empty

pipe_full

These are the two conditions that we care about for a pipe.

How we can use a condition variables?

CS111: Scribes Notes for October 28, 2013

Critical Sections

How to enforce Critical Sections

Build higher level synchronization out of lower level primitives

Blocking Mutexes