CS111 Lecture 10: File System Performance

October 30, 2008 (Week 5, Thursday)

By: Brian Nguyen, Alex Shmurakov, Glen Lenker

Review: Simple (Polling) Mutexes vs. Blocking Mutexes

A mutex or lock is simply an object used by multiple threads to keep track of whether a resource can be accessed or not. This allows us to guarantee mutual exclusion when different threads attempt to access the same resource. Mutexes come in two flavors:

Simple Mutexes: This mutex is implemented by performing test-and-set in a loop on a single word. Suitable for brief accesses where collisions are unlikely.
Blocking Mutexes: This mutex is implemented by keeping a list of threads waiting for a resource and blocking each thread until the resource becomes available. Suitable for any type of access, including longer ones where collisions may be frequent.

Using Blocking Mutexes Improperly

Let's attempt to use a blocking mutex in our pipe implementation. More specifically, when writing characters to a pipe, we want to lock the pipe to make sure that nobody else can write or read while we're filling the buffer with data, and we want to use a blocking mutex to lower the overhead of locking.


// Here is our pipe struct, with a blocking mutex
struct pipe {
	bmutex_t b;
	char buf[N];
	size_t r,w;
}

// Before writing to the pipe, we enter a loop and acquire a lock on the
// pipe. If the pipe is not full (p>w - p>r != N) we break from the
// loop; otherwise, we unlock and keep looping until this condition becomes
// true.
void writec(struct pipe *p, char c) {

	// We loop until the pipe is not full.
	for(;;) {
		acquire(&p->b); // Here, the process might block!
		if(p->w - p->r != N)
			break;
		release(&p->b);
	}

	// Now that we have a lock and the pipe is not full, we copy a character
	// into the buffer.
	p->buf[p->w++%N] = c;

	// Finally, we release the lock.
	release(&p->b);

}

What's wrong with our current implementation?

We're still polling the pipe for a condition even though we have a blocking mutex! But we're using a blocking mutex precisely because we want to avoid polling.

How can we fix this?

We'd ideally want a variable that is only true when the condition p->w - p->r != N is true.

In functional languages, this might be expressed as (lambda() p->w - p-> r != N).

We'd also want some sort of "ideal system call" that would put the current process to sleep until this variable becomes true.

For example, one nice system call might be waitfor("p->w - p->r != N").

Unfortunately, without hardware support these features are impossible to implement. However, we can come up with a good approximation...

Condition Variables

While there's no way we can have a variable that becomes magically true when a certain condition is satisfied, we can still have some variable that is set to true when we satisfy this condition. This forms the basis of using condition variables:

Condition Variable: Variable that is a bool at machine level with API surrounding it

A condition variable stands for a boolean expression of interest. We want a condition variable to have the following properties:

The variable is true when the boolean expression is true.
The variable is false when the boolean expression is false.

Sounds like exactly what we need. But what's the catch? It's the caller's responsibility to guarantee that this intended property actually works on the machine level. That is, when the boolean expression changes, we need to manually set the condition variable to true or false.

Requirements for a condition variable

In order for a condition variable to work correctly, we need:

The boolean expression defining the variable,
A blocking mutex protecting this expression,
An actual condition variable representing it.

The API surrounding a condition variable

What should the API surrounding our condition variable look like? We need a function that blocks the thread while the condition is false, and a function that wakes up waiting processes when the condition becomes true.


wait(contvar_t *c, bmutex_t *b);
// PRECONDITION: We have acquired b.
// ACTION: 
	// Releases b, then blocks until some other thread notifies us 
	// that condition MIGHT have changed. <-- Race condition occurs here.
	// Reacquires b
	// Returns

notify(condvar_t *c);
// ACTION: Notifies one waiting thread that condition MIGHT have become true.

broadcast(condvar_t *c); (OR notify_all(condvar_t *c);)
// ACTION: Notifies ALL threads that condition MIGHT have become true.

Condition variables are easy in Java, as these functions are already implemented in the Condition class (use lock.newCondition() to declare one!).

Using Blocking Mutexes Properly

Now that we know better, let's rewrite write!


struct pipe {
    // New conditional variables to use in conjunction with blocking mutex
    condvar_t nonfull;
    condvar_t nonempty;
    bmutex_t b;
    char buf[N];
    size_t r,w;
}

void writec(struct pipe *p, char c) {
    for(;;) {
	    acquire(&p->b);
	    // This releases b, waits until notify() is called on it, 
	    // then reacquires b
	    wait(&p->b, &p->nonfull);
	    if(p->w - p->r != N)
		    break;
	    release(&p->b);
    }
    p->buf[p->w++%N] =c;
    // We now need to notify the read end that there's data in the buffer!
    notify(&p->nonempty);
    release(&p->b);
}

Some important observations to keep in mind:

If we placed notify() AFTER release() we would get a performance bug (not a fatal one) since we would wake up people we shouldn't. (This is due to a race condition: a different reader might acquire and read the data before we notify our intended reader, and as a result, the reader that we do eventually notify will need to go back to sleep and wait.)
We still need to have a for(;;) loop around the wait() call because of the fact that returning from this call indicates that there MIGHT be a nonfull buffer--it's a notice, not a guarantee.

Deadlocks

Even with perfect mutual exclusion, we can still run into some very hairy problems when two threads or processes need to share the same resource. In fact, these problems arise as a result of using mutual exclusion, and are known as deadlocks:

Deadlock: A situation wherein two or more competing actions are waiting for the other to finish, and thus neither ever does. (From Wikipedia)

Where might a deadlock occur?

Deadlock can occur in any number of places where multiple threads need access to the same resource. For example, consider this implementation of cat:


... | cat | ...

Here, this implementation uses one read(0,...); call to read from standard input and uses one write(1,...); call to write to standard output. Is there any way to combine the two system calls into one to improve performance? The newly combined system call might look like xfer(0,1,nbytes);

We need to lock the two data structures in order to initiate transfer, which introduces the problem of deadlock. For example:


PROCESS 1: xfer(0,1,nbytes); // Grab a lock on 0, then grab a lock on 1
PROCESS 2: xfer(1,0,nbytes); // Grab a lock on 1, then grab a lock on 0
	// ......deadlock!

Preconditions for deadlock

Deadlock comes up in overly complicated systems, and is typically a race condition. There are FOUR PRECONDITIONS for deadlock:

Mutual Exclusion: Either one, or the other process can have a resource, but not both.
Hold and Wait: One thread or process can hold object 1 and wait for object 2.
No Preemption: One process can't simply steal a lock from another process if the other process is taking too long.
Circular Wait: You need a cycle of dependencies.

Options for dealing with deadlock

Option A: Build a graph

How do we detect circular wait? Use a Resource Allocation Graph (RAG), also known as a Dependency Graph.

File Lock Graph

Any cycle on the graph corresponds to a cycle of dependencies. Assuming the other three conditions are satisfied, this creates deadlock.

Circles stand for threads or processes; squares stand for resources or locks.
An arrow from a thread to a lock indicates a lock request.
An arrow from a lock to a thread indicates a lock is active.

Thus, one solution to preventing deadlocks is to manage a dynamic RAG of currently running threads and active resources, and refuse lock requests that would create a cycle in the RAG.

Advantage: It works! It can actually be determined in order n where n is the number of processes and resources active.
Disadvantage: It's tricky; threads need to be able to deal with -EDEADLK.

Option B: Lock ordering

In an ordered-lock system, every process must obtain the locks it needs in some arbitrary numeric order. If for some reason a process can't obtain all the locks at once (because some lock is taken by a different process), it must give up all locks and try again later.

Advantage: Assigning numbers to locks is much easier to implement than keeping a dynamically updated RAG.
Disadvantage: This system relies on threads grabbing locks in the right order. This might be a big limitation on what threads can do, and it might be a rather arbitrary one as well, especially if there is no clear hierarchy of resources.

Option C: Manually managing locks

In this case, we implement ONLY the correct pattern of locking in our threads. If you know the pattern will never generate a cycle, then we don't have to worry about checking for deadlock.

For example, we might want to manually manage locking of pipes between a parent and a child process:


	parent
	|   ^
	V   |
	child

To avoid deadlocks, we must guarantee that the parent process reads only after writing all data, and the child process writes only after reading all data.

In general, this approach does NOT work since finding a pattern that never deadlocks is much harder than simply detecting deadlock when it occurs.

Other options

We could preempt processes that hold locks if they take too long to complete, but this isn't a particularly nice option. In some sense it defeats the purpose of having locks in the first place, since we now can no longer guarantee a particular process that it has full control of a resource after grabbing a lock on it, and we might run into nasty race conditions.

We could also break hold-and-wait by simply denying requests made by a single process for multiple locks. While this still allows us to guarantee to a process holding a lock that it has exclusive control of the associated resource, it severely limits what threads can do since no thread can have more than one resource open at a time.

Priority Inversion

This is a non-obvious bug that arises from two seemingly unrelated features: locks and priority scheduling.

Assume we have a blocking mutex and three threads running: one with low priority, one with high priority, and one with medium priority. They each need to acquire the blocking mutex at some point.

T_low	T_med	T_high
Runnable	Waiting	Waiting
acquire(&m); (some short block of code)
------- context switch ------>
Waiting	Waiting	Runnable acquire(&m)
	<-- context switch --
Waiting	Runnable (some long block of code)	Waiting

A low priority process might grab a lock and execute a little bit of code before a context switch is made by the scheduler to a high priority process in need of attention. This process requires the same lock as the low priority process, so it sleeps until the low priority job releases the lock. Then a context switch is made to a medium-priority process, which takes a long time to perform some menial task.

Why is this bad? A medium priority thread is running while a high priority thread wants to run! There will never be a context switch from the medium priority thread to the low priority thread that holds the lock, so the high priority thread hangs.

One high-profile example of this is a system failure of the Mars Pathfinder in 1997. The low-priority task in this case was an atmospheric measurement suite which acquired a lock on a shared resource, also needed by a high-priority task which managed information between different components of the robot.

Solution: Priority borrowing (priority inheritance)

If a low-priority thread has a lock needed by the high-priority thread, be sure to temporarily increase the locking thread's priority so it has a chance to complete its code even in the case of a context switch.

File System Performance

Comparison of disk speed to CPU speed

Back in the 1960s, CPUs were very, very slow! One of the fastest computers of the time, the Cray Supercomputer, had a speed of about 80 MFlops. There wasn't a need to worry about disk scheduling as much: the disk was fast enough for the CPU at the time.

Over the next few decades, however, the CPU speed has gone up astronomically, following Moore's law quite reliably; on the other hand, disk speed has stayed relatively constant. As a result, disk scheduling now is a much more important problem than it was back then.

Why are disks so slow?

Diagram of a typical hard drive

First, let's go over some useful terminology:

Platter: Circular disk that stores the magnetic data of a hard drive. Usually a hard drive contains about 3-5 platters.
Track: A cocentric circle on the surface of a platter that contains a sequence of magnetic data.
Cylinder: A set of tracks on different platters of the hard drive, all of the same diameter.
Read/Write Head: A mechanism used to read data from a track, sending the resulting signals on a bus to the disk controller, and also write data sent from the disk controller onto disk. Read/write heads are typically aligned to work along a common cylinder; that is, every head is the same distance from the center of the disk. To avoid damaging the magnetic media, read/write heads float along a thin film of air formed while the disk is spinning.
Disk Controller: A device which allows a hard drive to communicate with the main bus, interacting with the CPU or, if direct memory access (DMA) is supported, sending or receiving data directly from RAM.

The reason behind unimpressive disk speeds lies in the steps that need to be performed in order to retrieve data from disk. There are three main sources of disk latency:

Seek time: The time it takes to move to the right track
1. Time needed to accelerate read/write head towards the track
2. Time needed to decelerate read/write head once it reaches the track
3. Longer for write head, since it needs to settle in place to avoid damaging other tracks
Rotational latency: The time it takes to wait for the read head to pass over data
Transfer latency: The time it takes to transfer data from disk to RAM
1. Time needed to wait for sector to pass by head
2. Time needed to copy data from controller to RAM

All of these steps tend to be painfully slow compared to CPU speed, since the components of a hard drive tend to be much larger than those of a comparable CPU. Unlike newer technologies like flash memory, hard drives contain moving parts; as a result, there are physical limitations to how fast one can change the momentum of the disk and the read/write head, and these limitations translate into unavoidable latencies.

Real-World Specs: Seagate Barrucda ES.2 1TB Hard Drive

This drive is typically used for nearline storage, a "bread-and-butter drive" that is used to store secondary data in a data center. This drive is thus better than consumer-grade hard drives, but not fast enough to handle the most frequently accessed stuff in an enterprise cluster. The Barracuda has the following specs:

Serially Attached SCSI (SAS): This is a disk controller industry standard, which recently emerged to replace the old parallel SCSI bus standard.
7200 rpm (120 Hz): The rotational frequency of the hard drive, or how many times it performs a full rotation per minute (second).
8.333... ms: The rotation time of the hard drive. This is equal to the inverse of rotational frequency.
8.5/9.5 ms: The average read/write seek time. The tolerance for the write head is lower than the tolerance of the read head, so we need to wait longer in the case of a write.
0.8/1.0 ms: The track-to-track seek time, or how long does it take to move the read/write head between adjacent tracks.
20 ms (guess): The time it takes to perform a full-stroke seek (from the outer rim of a platter to the center of the disk)
1.29 Gb/s: Maximum internal transfer rate, or the speed data is copied from disk platter to buffer. (Notice that Gb means gigabits, NOT gigabytes!)
0.93 Gb/s (~116 MB/s): Maximum sustained transfer rate, or the speed of doing a full-stroke read/write of the data, including switching heads, seek, and processing.
3 Gb/s: The interface speed; that is, how fast the disk controller can talk to RAM.

Next time...

How much time will it take to read one sector of data from this disk?