CS111 Lecture 10

Deadlock and File System Performance

Lecture by: Professor Paul Eggert
Notes by: Justin Hou, Kevin Tong, Aland Kuang, Alexander Ding

Deadlock
Livelock
Data Races and Event Driven Programming
Hardware Lock Elision
File System Performance

Deadlock

Deadlock is a precarious and potentially system-crashing phenomenon that occurs when two or more processes are simultaneously waiting for one another to finish. Because neither is willing to proceed until the other completes (usually because the other is holding a resource the current process wishes to acquire), the processes halt and end up waiting forever. Deadlock, however, can be avoided entirely if the following four conditions are circumvented:

Four Conditions of Deadlock

Circular Wait - When a chain of processes are sequentially waiting for something from one another.
Mutual Exclusion - When a resource can be used by only one process at a time.
No Preemption of Locks - When processes are disallowed from breaking other processes' locks (ie. no shoving others out of their lock).
Hold and Wait - When a process holds/locks a (mutually exclusive) resource and waits for another process to finish before proceeding.

Attacking the Conditions of Deadlock

Conditions 2 and 3 are not easy to eliminate, so programmers often resort to eliminating circular wait. This is because it is not easy to dismantle intuitive principles like mutual exclusion or locking - after all, what is the point of locking if others may use preemption to unlock it?

Solution 1: Eliminating Cycles (Circular Wait)

Circular wait can be circumvented if wait cycles are prohibited by the resource manager. In this approach, the resource manage carefully check for cycles whenever a process wishes to acquire (and therefore potentially "wait for") a mutually exclusive resource. If the current process would create a cycle by waiting for a certain resource, the resource manager will simply refuse the request, thereby eliminating any chance of deadlock

There are drawbacks to this approach however. Suppose there are P processes and R resources. If the resource manager was delegated to look for circular wait in all processes' requests, we can estimate the time-complexity of such an operation to be O(P*R).

This is undesirable; we want it to be of time-complexity O(1). Is this possible?

Solution 2: Lock Ordering (Hold and Wait)

We can bring the time-complexity of avoiding circular wait down to O(1) if we use lock ordering. This is made possible is all processes cooperatively comply with the following rules:

Processes lock resources in increasing order (with respect to the resource machine address) as they seek to acquire them.
If a process encounters a resource that is busy (another process has locked it already), it forfeits all of its existing locks and starts over.

This way, processes never hold and wait, meaning they never lock a mutually exclusive resource while waiting for another mutually exclusive resource. By forfeiting all locks when a process encounters a busy resource, deadlock is avoided entirely.

The problem with this appreach, despite its tremendous efficiency, is that it is inconvenient for software developers to comply with. Programmers are not accustomed to writing code that actively concedes to other processes, especially when it is not clear which resources may be in use by other processes. Simply put, it is unreasonable to implement a lock ordering solution unless all parties involved are willing to comply and write cooperative code.

Quasi-Deadlock: Priority Inversion

Sometimes, a deadlock-like problem arises as a result of priority inversion. Priority inversion occurs when mutually exclusive resources are held indefinitively by low-priority processes as a result of higher-priority processes waiting to acquire them. This is easily explained in the following scenario.

Suppose there are three process priorities under which processes can be catagorized: low, medium and high, with higher-priority processes being executed sequentially before lower-priority processes.

A low-priority process locks a resource and yields to a high-priority process, which executes until it also needs the locked resource. Sensing that it must wait until that resource is free, the system schedules the medium-priority process to run until the high-priority process can obtain the lock for the resource. This, however, never happens because the medium-priority process will never yield to the low-priority process, who must run in order for the desired resource to be freed. This is a stealthy but extremely dangerous form of quasi-deadlock.

Solution 1: Eliminate the Medium-Priority Category

If there were only two tiers of priority, low and high, this quasi-deadlock would never occur. The high-priority process would simply yield to the low-priority process who holds the lock on the desired resource. This effectively solves the problem, but it is not a useful solution because it would limit the range of priorities to only two categories.

Solution 2: Priority Transformation

A simple solution to this problem that does not limit the range of priority categories would be to implement priority transformation, the elevation of a process' priority to the current process' priority if it holds a resource the current process desires.

If applied to the aforementioned example, priority transformation would cause the low-priority process to be elevated to a high-priority process the moment the original high-priority process requested an acquire on the locked resource. This solution is preferred because it effectively prevents quasi-deadlock while avoiding drastic change to the priority system itself. With priority transformation, programmers are permitted to have three or more tiers of priority without the fear of suffering quasi-deadlock

Livelock

Livelock is similar to deadlock in that no processes manage to execute, but the states of the processes involved in the livelock are changing, while they remain the same in deadlock.

Consider a system where a bounded buffer receives incoming requests for work and the interrupt service routine has higher priority than the work routine. When the bounded buffer is full, as incoming requests are received, the interrupt service routine halts the current process to handle the request. Because the bounded buffer is full, the request cannot be added to the buffer and is rejected. If there are more requests coming in than can be processed by the CPU, the system will be in a state of livelock as all of its time will be spent handling interrupts instead of working.

One simple solution to livelock is to change the priorities of the system such that the work routine is given higher priority than interrupt handling as the processing capacity nears 100%. In this scenario, the work routine will have highest priority, allowing the bounded buffer to clear some capacity before reverting the priorities back.

Data Races and Event Driven Programming

The basic problem of data races comes from the fact that there are two or more threads running in parallel. An obvious solution is to just use one thread, though we still want to be able to wait for events.

One implementation is as follows:


while (1) {
	/* Wait for some event e */
	e -> handler(e);
}
/* A caveat is that since the handlers never wait, they must complete their task quickly. */

With this implementation, there are no locks, which means that there are also no deadlocks or tricky bugs related to deadlocking. However, because handlers are restricted, the programmer may be forces to break up handlers for different resources. This also cannot scale through multithreading because of shared memory, though it can scale through multiprocessors and multinode.

Hardware Lock Elision (HLE)

One problem with spin locks is that too much CPU time is spent locking and unlocking. In this instance, there is hardware support in the form of Hardware Lock Elision, which provides two new instruction prefixes: xacquire and xrelease. Their implementations are as follows:


lock	movl $1, %eax
try:	xacquire lock xchgl %eax, lock
	cmp $0, %eax
	jnq try
	ret

unlock	xrelease movl $0, lock
	ret

HLE is backwards compatible, though it will be slower for older machines. In the case that something goes wrong, the cache of the current thread will be flushed, effectively undoing all acctions of the critical section so other threads will not be affected

File System Performance

In order to contsruct a stable, efficient Operating System, it is important to implement an efficient file system, which is a framework that regulates the generation, management, and deletion of data in the form of files.

Example File System

Suppose we have a file system:

Size: 120 PB (spread over 200,000 hard drives)
File System: GPFS (General Parallel File System)

The file system chooses to spread the data out into many hard drives becuase this allows us to preserve data safely without any need to worry about disk failure. How can this be done?

Performance Ideas

To construct a robust and efficient file system, it is beneficial to understand a variety of performance ideas:

Idea 1: Data Striping

Supplementary Reading: http://en.wikipedia.org/wiki/Data_striping

Data Striping is the simple process of segmenting data so that it can be placed consecutively over any number of sequentially ordered physical data storage devices.

The smaller hard drives, ordered sequentially, seemingly emulate one large hard drive. This means that while small data may comfortably fit within one hard drive, large data files may be segmented and stretched over multiple hard drives.

While this system is intuitive, and hard drive failure can be critical to the integrity of the file system.

Idea 2: Distributed Metadata

The Distributed Metadata system can essentially be understood as a cloud of nodes. How each node is chosen to be implemented is independent of the overall function of the distributed metadata system. There are two types of nodes:

Data Node - Node that stores data.
Master Node - Node that oversees its own cluster of data nodes, and stores important information like the file name, file location, file owner, and file timestamp.

The central idea behind this file system is the duplication of valued data over a multitude of data nodes. If a certain file is accessed frequently (relative to other files in the file system), it is actively duplicated and stored in more and more data node clusters. This way, throughout the entire cloud of nodes/node clusters, duplicated of neglected files are gradually overwritten by duplicated of more popular files.

Idea 3: Network Partition Awareness / File System Stays "Live"

It is important that the file ststem is functioning and running the majority of time; this is particularly true when the file system undergoes maintenance. Whether the service involves the replacement of faulty storage devices or the integration of new storage devices, system administrators strive to disable as small of a number of nodes as possible.

If only a small fraction of the total nodes are "frozen" at a time, it is entirely possible to modify or update the entire file system without disabling user access. Clients would merely experience slower traffic as they access the smaller network of active nodes. Through repeating this process, the file system would eventually be fully updated without experiencing any real downtime

File System Performance Metrics

There are three primary metrics used to measure the quality of a file system:

Latency - Delay between sending a request and receiving a response.
Throughput - Rate of request completions.
Utilization - Fraction of the capacity of the system doing useful work (requests/sec).

We can observe utilization under multiple contexts (CPU, bus bandwidth, etc). For the following examples, we are monitoring CPU utilization.

Example System Characteristics

Suppose we have a system that performs the following tasks in their respiective listed times (μs = microsecond):

PIO = 1μs
Send Command to SSD = 5 PIO = 5μs
SSD Latency = 50μs
Compute Buffer = 5μs
Read 40 bytes from SSD = 40 PIO = 40μs
Block Until Interrupt = 50μs
Polling & Scheduling = 50μs
Handle Interrupt = 5μs
Check Ready = 1μs

For the following implementations, consider their respectice latencies:

Method 1: Busy-Wait + Polling


for(;;) {
	char buf[40];
	/* Read 40 bytes from SSD into buf */
	/* Compute Buffer *
}

Latency (μs):
Send Command to SSD (5μs) + SSD Latency (50μs) + Read 40 bytes from SSD (40μs) + Compute Buffer (5μs) = 100μs
Throughput (requests/sec):
1 / Latency (100μs) = 10,000 req/s
Utilization:
Compute Buffer (5μs) / Latency (100μs) = 5%

This method is intuitive, but it is not terribly efficient. We are only utilizing 5% of CPU power.

Method 2: Batching


for (;;) {
	char buf[840];
	/* Read 840 bytes from SSD into buf */
	int i;
	for (i = 0; i < 21; i++)
		compute(buf + (i*40)); /* Compute Buffer */
}

Latency (μs):
Send Command to SSD (5μs) + SSD Latency (50μs) + Read 840 bytes from SSD (840μs) + Compute Buffer (105μs) = 1000μs
Throughput (requests/sec):
1 / Latency (1000μs) = 21,000 req/s
Utilization:
Compute Buffer (105μs) / Latency (1000μs) = 10.5%

The Batching method allows for double the utilization than the previous method, but may prove inefficient if users do not intend to read in large amounts of data.

Method 3: Block Until Interrupt


for (;;) {
	/* Block Until Interrupt */
	do
		/* Handle Interrupt */
	while (!ready);
}

Latency (μs):
Send Command to SSD (5μs) + Block Until Interrupt (50μs) + Handle Interrupt (5μs) + Check Ready (1μs) + Read 40 bytes from SSD (40μs) + Compute Buffer (5μs) = 106μs
Throughput (requests/sec):
1 / Latency (106μs) = 17,857 req/s
Utilization:
Compute Buffer (5μs) / Latency (106μs) = 8.9%

The utilization is a little bit less than before here, but the latency has been drastically reduced. As evident in all of these methods, reading from the SSD without independence from the CPU is taking a lot of time. Direct Memory Access (DMA) allows the OS to work around this.

Method 4: Direct Memory Access (DMA)

Supplementary Reading: http://en.wikipedia.org/wiki/Direct_memory_access


while (1) {
	/* Write Command to Disk */
	/* Block Until Interrupt */
	/* Check if Disk is Ready */
	/* Compute Buffer */
}

Latency (μs):
Block Until Interrupt (50μs) + Handle Interrupt (5μs) + Check Ready (1μs) + Compute Buffer (5μs) = 61μs

Because we are using Direct Memory Access, we do not need to consider the Block Until Interrupt for our individual cases of latency, so our new latency is now = 11μs

Throughput (requests/sec):
1 / Latency (11μs) = 91,000 req/s
Utilization:
Compute Buffer (5μs) / Latency (11μs) = 45%

This result is much more favorable than the results of the previous methods--in fact, methods involving DMA like this one are used by most operating systems. It is clear Direct Memory Access is an ideal way for the OS to operate, as it allows hardware to access memory independently of the CPU. However, performance can be further enhanced if polling is implemented in coordination with DMA.

Method 5: Direct Memory Access (DMA) + Polling


while (1) {
	/* Write Command to Disk */
	/* Polling & Scheduling Implementation */
	/* Check if Disk is Ready */
	/* Compute Buffer */
}

Latency (μs):
Polling & Scheduling (50μs) + Check Ready (1μs) + Compute Buffer (5μs) = 56μs

Because we are using Direct Memory Access, we do not need to consider the Polling & Scheduling for our individual cases of latency, so our new latency is now = 6μs

Throughput (requests/sec):
1 / Latency (6μs) = 166,667 req/s
Utilization:
Compute Buffer (5μs) / Latency (6μs) = 84%

Direct Memory Access + Polling nearly doubles both the throughput and utilization of the system. Notice the performance differences between Method 1 and Method 5. With Direct Memory Access, it is possible to construct a system that operates much faster and with much more efficiency than without. However, it is important to remember that DMA is a privilege that software outside of the OS should probably be disallowed from using, as DMA would allow malicious programs operate independently of the CPU that manages the operating system.

CS111 Lecture 10

Deadlock and File System Performance

Table of Contents

Example System Characteristics