CS 111 Lecture 10 Scribe Notes

Scribes: Ken Ohhashi, Colin Fong, Stanway Liau

May 5, 2014

Conditions for Deadlock

Mutual Exclusion: At least one of the resources of a process cannot be shared at any given time
Hold And Wait: A process can hold a resource or or wait for one.
No Preemption: The operating system cannot take away resources from a process once it has started. The process must release the resource itself.
Circular waiting: A process is waiting for a resource held by another process, which is also waiting for the first process to release its own resource.
- Can create wait-for graphs (see below) for each process and don't allow processes to wait and depend on others
  - has cost O(P*R) where P = # of processes and R = # of resources

Remove any one of these conditions prevents deadlock
If there were no mutual exclusion or holding/waiting, a process would not need to wait for another process with a requested resource to finish; it can take the resource at any time without waiting or locking.
If there were preemption, the OS would be able to interrupt a dead locked processes and continue them later
If there is only one lock or resource, deadlock can't happen

Resource Allocation Graph/ Wait-For Graph

A box represents a resource
A dot in a resource represents an instance of that resource
A circle represents a process
An arrow from a process to a resource is a request for a process (wait)
An arrow from a resource to a process is an allocation of a resource
A cycle of arrows may imply a a deadlock, but doesn't guarantee it

**There are two deadlock cycles in this case.**

In this case, P1 is waiting on P2 which is waiting on P3 which is waiting on P1 or P2 to release their resources.

**There is no deadlock in this case even though there is a cycle.**

In this case there is a cycle, but P2 or P4 can release an instance of their respective holds on the R1 and R2 resources to release the supposed deadlock between P1 and P3.

Deadlock Prevention

Lock Ordering
- Lock resources in increasing order by machine address (no separate resource ID needed)
If you find a lock busy, give all your locks up and start over (skips hold and wait O(1))
- Assumes application behaves well

Quasideadlock

A higher level process is always waiting on a lower priority process

EXAMPLE

-There are 3 levels of priority for threads


		    *Runs from left to right and downward

		    |Tlow(runnable)   | Tmed(waiting)   | Thigh(waiting)

		  1.|acquire(&r)----->|-context switch->| (runnable)

		  2.|             |         | acquire(&r)

		  3.|                 |   (runnable)    | (waiting)

low priority thread will acquire resource and switch to high priority thread
the high priority thread will attempt to grab the resource that the low level process just grabbed
the high priority process will yield and wait on the next prioritized process. Medium level processes can run indefinitely and starve the high priority thread because the low priority thread will never run and release the resource

Potential Fixes

can be fixed by not using 3 levels of priorities
Priority Inheritance: The low level process temporarily gains a high priority status to release the resource

Recieve Livelock

Not using 3 levels of priorities
Priority Inheritance: The low level process temporarily gains a high priority status to release the resource


		            BUFFER

		       INPUT            OUTPUT
		requests----->[=======================]---->CPU

		       100 Gb/s           10 Gb/s

An interrupt service routine grabs data and puts it into the buffer that has a higher priority than the current work routine of the cpu
The cpu can be overwhelmed by the buffer with higher priority requests instead of doing its own work

A fix can be that the work routine checks how full the buffer is and only receive interrupts if more than 50% of the buffer is filled up

Data Races Are Evil

The basic problem we have is that there is more than one thread running at one time. What if we use just one thread? This will solve our problem of data races, but we still want to be able to wait for events. We can use event-driven programming to do this. In this type of programming, there is usually a main loop that listens for events. Below is a simple illustration of the concept:


		  main program {
		    while (live) {
		      wait for some event e; //only thing that waits for an event
		      e->handler(e); //handlers never wait and must complete quickly
		    }
		  }

Upsides:

No locks!
No deadlock!
No bugs due to forgetting to lock!

Downsides:

Handlers are restricted: they may need to be broken up
Cannot scale via multithreading

Although multithreading is out of the question, we can still scale via multiprocess/multisystem instead. Some supporters of event-driven programming claim that the scaling provided by multiprocess makes that of multithreading negligible, thus making that downside irrelevant.

Hardware Lock Elision

In multithreaded programming, too much CPU time is spent waiting for locks to unlock. To improve the wait time, we can get hardware support through Hardware Lock Elision(HLE)

Below are example Haswell instructions for HLE. Haswell is a codename for a processor microarchitecture, although this information is not critical to understanding the concept.


		lock:
		  movl $1,%eax
			
		try:
		  xacquire lock xchgl %eax, lock
		  cmp $0, %eax
		  jnz try
		  ret

		unlock:
		  xrelease movl $0,lock
		  ret

HLE adds two instruction prefixes, xacquire and xrelease, to the architecture. xacquire allows the thread to move ahead "optimistically", executing subsequent instructions, as if it had the lock. The thread modifies its own cache, but doesn’t flush it out to memory. Once it gets to, xrelease will check the value that was traded into the lock, and if the thread didn’t get the lock properly, it will discard all changes in its cache that occurred and act as if none of those instructions happened. If it did get the lock, the cache is then flushed into RAM, since the instructions are valid.

File System Performance

Say we have a General Parallel File System (GPFS) with 120PB (~200,000 hard drives, each 600GB). Here are some performance ideas for such a file system:

1) Striping

Usually, processing long files requires enormous buffers. By distributing segments of the file to multiple devices and accessing each segment concurrently, we can increase the total throughput and process the file quicker.

2) Distributed Metadata

Metadata contains information about a file like its location, owners, timestamp, etc. Each node in a file system can act as a data node that holds file data, or a meta node that holds metadata. There are no master/slave nodes, it is just a cloud of data and meta nodes.

3) Efficient directory indexing

Suppose a directory in the file system has 10,000,000 files. If the directory is implemented with an internal data structure, instead of , An possible data structure for this is the B+ tree, which is an n-ary tree with an often large number of children per node. Because this limits the depth of the tree, stored data/files can be retrieved efficiently with minimal I/O operations..

4) Network Partition Awareness

This is when a file system is partitioned, the larger partition stays live (can write, read, remove, and add files to it), and the smaller partition is frozen, or read-only.

5) File system stays live during maintenance

Sometimes, one can't connect to a server like the SEASNet Linux server or MyUCLA because they are doing maintenance on it. It would be more convenient for users if the server could stay live during maintenance, but it would be hard to implement, since users actions could affect some part of maintenance that could be dangerous for the server/file system.

These are all performance issues that would be implemented in an ideal system, but alas, is difficult to do.

[Some] Performance Metrics for File Systems

We will analyze these three performance metrics for file systems:

Latency: delay from request to response
Throughput: rate of request completions
Utilization: fraction of capacity doing useful work

Assume a system with the following characteristics:

CPU cycle time: 1 ns
Programmed I/O instruction (PIO): 1000 cycles = 1 μs
Secondary storage latency (SSD): 5 PIOs = 5 μs
SSD latency: 50 μs
Computation per buffer: 5 μs
Read 40 bytes from SSD: 40 PIOs = 40 μs
Block interrupts: 50 μs

busy wait + polling


	/* Busy wait + polling code */
	for int(;;) {
		char buf[40];
		read-40-bytes-from-SSD-into-buf;

		compute(buf);
	}

The "naive" way to do it. Simply request some data, process it, and repeat.
Pros:
- Logical and straightforward
Cons:
- Very inefficient

Latency
- 5 μs (send command to SSD)
- 50 μs (SSD latency)
- 40 μs (read latency)
- 5 μs (computation)
- 100 μs total
Throughput
- 1/latency = 10,000 requests/second
Utilization
- 50,000/1,000,000 = 5%

Batching


	/* Batching code */
	for (;;) {
		char buf[820];	//larger batch
		read 840 bytes;
		for (i = 0; i < 21; i++)
			compute(buf + i*40);
	}

Idea is to grab a batch of data at a time to have less overhead per byte reading from SSD (batch size = 21*40 bytes in this example)
Pros:
- Less overhead per byte
- Increased throughput
- Increased utilization
Cons:
- Increased latency
- Must modify code to use larger batches

Latency
- 5 μs (send command to SSD)
- 50 μs (SSD latency)
- 840 μs (read latency)
- 105 μs (computation)
- 1000 μs total
Throughput
- 21 requests/1 ms = 21,000 requests/second
Utilization
- 105/1000 = 10.5%

Avoid busy waiting using Interrupts


	/* Avoid busy waiting code */
	for (;;) {
		do {
			block-until-interrupt;
			handle-interrupt;
		} while (!ready)
	}

Idea is to use blocking to avoid constantly polling and wasting time. Just wait for interrupts to proceed.
Pros:
- Don't waste time polling
- Increased throughput
- Increased utilization
Cons:
- Still time spent context switching

Latency
- 5 μs (send command to SSD)
- 50 μs (blocking)
- 5 μs (interrupt handler)
- 1 μs (check if ready using a PIO)
- 40 μs (read latency)
- 5 μs (computation)
- 106 μs total
Throughput
- 1 request/56 μs = 17,857 requests/second
Utilization
- 5/56 = 8.9%

Direct Memory Access (DMA)


	/* DMA code */
	while (1) {
		write-command-to-disk;
		block-until-interrupt;
		check-disk-is-ready;
		compute;
	}

Idea is that the CPU just tells the SSD to send data directly to RAM without having to route data through the CPU
Pros:
- Decreased latency
- Increased throughput
- Increased utilization
- Significant improvement in all three metrics
Cons:
- Cost of overhead of context switching

Latency
- 50 μs (block)
- 5 μs (interrupt handler)
- 1 μs (check ready)
- 5 μs (computation)
- 61 μs total
Throughput
- 1 requests/11 μs ~ 91,000 requests/second
Utilization
- 5/11 ~ 45%

**Memory accesses don't have to go through the CPU**

Polling + DMA


	/* Polling + DMA code */
	while (DMA-slots-not-ready)
		schedule();

Idea is to rely on event-driven programming to avoid context switching overhead and gain even more performance than just DMA
Pros:
- Drastic performance increase in all three metrics
Cons:
- A bit unintuitive

Latency
- 50 μs (block)
- 1 μs (check disk)
- 5 μs (computation)
- 56 μs total
Throughput
- 1 request/6 μs = 166,667 requests/second
Utilization
- 5/6 ~ 84%