CS111 Scribe Notes

Lecture 10: 11/4/2013

Contributors: Tania De Pasquale, Stephen Hernandez, David Powell

Condition variables

What is a condition variable?

A Boolean condition defining the variable (programmer's responsibility)
A blocking mutex to protect the condition, the condition can only change when you own the mutex
A condition variable of type condvar_t representing the condition

API:

wait(condvar_t *c, bmutex_t *b)
// Precondition: you have acquired b (already blocked)
// Action: releases b, then blocks until some other thread/process notifies the OS that the condition is true, then reacquires b and returns

(bmutex blocks the condition variable)

notify(condvar_t *c) 
// Notifies 1 other thread that the condition is true
// Broadcast/notifyAll notifies all threads currently blocking
// -- Advantages of broadcast: application specific priorities (let the application decide who to wake up)

Example:

struct pipe {
  bmutex_t b;
  condvar_t nonfull, nonempty;
  size_t r, w;
  char buf[N];
}

void writec(struct pipe *p, char c) {
  acquire(&p->b);	//spinlock until we get lock: if we can't get block until we can
  while (p->w - p->r == N)	// not if condition in case something changed while waiting (race condition)
    wait(&p->nonfull, &p->b);
  p->buf[p->w++ % N] = c;
  release(&p->b);
  notify(&p->nonempty);
}

* wait / notify don't promise anything, they wake the process up so the process can check if it should do anything

Sephamores

What is a sephamore?

A blocking mutex is a binary semaphore.
A semaphore is like a blocking mutex, except it protects N instances of a resource instead of just 1
- e.g. create semaphore with N = 5
- Resources available if N > 0 and out of resources if N = 0
Primitives:
- prolaag P (like acquire, which decrements if N > 0 otherwise waits)
- verilaag V (like release, increments N)

Hardware Lock Elision (HLE)

Released in June 2013:

We want the effect of spin locks w/o the cycle delay
Spin locks are pessimistic, can't do anything while it locks
When you have a lot of people trying to access the lock, there are large gaps of delay between the lock and the work

Example:

xacquire lock xchgl %eax, lockvar	//optimistic variant of xchgl (load without waiting)
    // Under the assumption that the lock will work
lock:
try:
xacquire lock xchgl %eax, lockvar
cmp $0, %eax
jnz try

unlock:
xrelease movl $0, lockvar
    // If we didn't get the lock, the operations vanish like an undo (expensive) and we try again

Deadlock

Sometimes deadlock can come about in unanticipated ways and is tricky.

say: want a fast cat
eg: sort foo | cat | tr ...

We would want to skip the buffer like this:

while(copy(0, 1, 40%)>0)
  continue;
// The copy syscall should speed it up right?
// This introduces deadlock, however
// Acquires 0 then 1, if someone else does copy(1, 0, ..) acquires 1 then 0 = DEADLOCK!

Inside cat:

ssize_t r;
while ((r = read(0, buf, sizeof(buf))) > 0)
  write(1, buf, sizeof(buf));
exit(0);
// OS -> buffer -> userspace

How to avoid deadlock:

Don't lock [don't require exclusion]
Don't allow >1 lock per process at a time
// If a process is holding a lock, then don't allow it to try to acquire another
Lock ordering - "try acquire"
// Assume some lock ordering (in practice oftentimes the memory address of locks)
// Acquire locks in ascending order; if any lock is busy, then give up all locks and start over
Dynamic detection of cycles (used in lab)
// If the new lock would create a cycle, reject it
// Actually very common [used by linux]

File System Performance

Problem: storage hierarchy

ALU	(a few nanoseconds) (1 cycle)
Registers	(sim. to ALU)
L1 cache	(2 cycles)
L2 cache	(4-8 cycles)
DRAM	(100 cycles)
Flash ssd
Hard disk
Mag tape library	(5 mins latency)

* Ideally disk and registers would be the same
* Why are the secondary devices slow? (SSD and HD are very slow)

Disk: 7200 rpm = 120 Hz read-heads float over any track; if data is all in the same track, then don't need to move the read-head To grab a random sector off disk:

Accelerate read heads in direction of desired track
Halfway there, decelerate
"listen" for the correct track and settle down
Wait for data to come under the head
Copy data out of the sector and into the disk controller cache
Transfer data from cache into CPU or RAM

//These are typically estimates based on previous disk implementation
Typical 1TB drive: (Seagate Barracuda ES2)

16 MiB cache, 7200 RPM
rotation time: 8.333... ms
average rotational latency: 4.1666... ms
average read/write seek: 8.5/9.5 ms
//takes longer to write than to read because the read can be in the general area while the write better be right in the middle of the track
track-to-track seek: 0.8/1.0 ms
max internal xfer rate: 1.29 Gb/s // from disk to cache
max sustained xfer rate: 0.93 Gb/s //rate at which it can transfer from disk to cache in long term (for grabbing all data)
external xfer rate: 3 Gb/s //from cache into bus
bytes/sector: 512/520/524/528 //with Linux just use 512
typical: 12.5 W
idle: 9 W
MTBF (mean time between failures): 1.2 million hours
AFR (annualized failure rate): .73% //if left on 24/7, .73% chance it will fail in a year
Nonrecoverable sector read failure rate: 10^-15

SSD (Corsair Force GT):

60 GB
2.0 W max
1 million hours MTBF
280 MB/s (1.74 Gb/s) read
270 MB/s (1.7 Gb/s) write

* Flash wears out! You can only do so many writes.
* Issues: wear leveling, if there is a "hot block" wore out, use a different block in its place

for (;;) {
  char c;
  if (read (0, &c, 1) != 1)
    break;
  process(c);
}
// assuming worst possible implementation (6 HDD Steps above):
// Each read will take a long time
// avg 4.16 ms + 8.5 ms + [.0005/100MB/s xfer rate (too small to matter)]

Batching: take control of setup/wait and do once, then grab as much data as you can

+ do data in large batches so that the control overhead is less per byte
+ can improve performance by up to the batching factor
- change your source code

Ex. Say 1024 read and not 1, assuming process time is insignificant & read is bottleneck, 1000x faster Prefetching:

OS guesses future reads & saves its guess in buffer
Separate buffer of most recent sector read, then read grabs from this buffer
Correct guesses = same performance enhance as batching
Overhead can be done in parallel 8.5 + 4.166 ||

Dallying:

+ Kernel can wait for another write into the same/similar sectors before it actually writes
+ Sometimes this works especially if program will write again
+ Delaying work makes work more efficient
- Problem: if you don't write, then it's just a postponement and wasted time
- Things can go wrong - OS shuts off and you lose the "writes"

// can be a win to save the USB from wearing as quickly