Circular wait means that a chain of processes are all waiting in a circle. For example, process A is waiting for process B. Process B, in turn, is waiting on process C. Process C is waiting on process A. Since all three processes are waiting in a circle, they will end up waiting indefinitely for the others to finish.
Mutual exclusion means that no two processes can hold a single resource at the same time. Essentially, when a process is holding a resource, it is excluding other processes from acquiring it.
No preemption of locks means that no process is able to "shove" other processes aside and acquire resources that are being held. When a lock is locked, no other process can break in and take the resource being locked.
Hold and wait means that processes need to be able to hold a resource while waiting for another.
There's a catch to all of this, however. We need to fulfill these four conditions in order to get a deadlock. However, deadlock is not usually a desirable feature in a system - so we'd want to make one of these conditions false and therefore get rid of the possibility of deadlock in our system. These conditions can also be considered as features of our system, however, and each condition can be a beneficial addition to our system as a whole. If we give up mutual exclusion so that deadlock will not occur, for example, then we will lose the ability to make sure that a resource is only being held by a single process at one time. This could lead to race conditions and unreliable outputs.
So which one of these features/conditions should we give up?
Usually, the answer is that we want to give up circular wait.
There are several ways that we can design our system to avoid deadlock.
The first way is looking for cycles every time a wait command is executed. We do this by creating a wait-for graph.
Processes in this graph are displayed as circles and resources as squares. A solid line indicates that a process has acquired the resource in question, while a dotted line indicates that a process is trying to acquire the resource in question.
This is an example of a wait command about to be executed which causes circular wait:
As we can see, if the second process tries to acquire the resource that the first process has, it will cause circular wait, because the first process is trying to acquire, or in other words, is waiting for the resource that the second process holds. While the second process is waiting for the first process's resource, which will not be given up, it will continue holding its own resource, thus ensuring that the first process never finishes its wait. Deadlock will occur.
However, this is easily prevented now that the wait-for graph has been established. By checking through the graph, the system is able to realize that there will be a cycle created in the graph if the try-acquire is allowed. It will not allow the execution of the wait command, and circular wait will have been prohibited. Deadlock will be avoided.
Performance-wise, this is a somewhat slow algorithm. The cost for checking for cycles in this graph depends on the number of processes, designated as P, and the number of resources, designated as R. With a naive checking algorithm, this is likely to take somewhere on the order of O(PR) or worse. There is a way to do this in O(1), or constant time, but it requires putting further constraints on the system.
Lock ordering is our second method of avoiding deadlock. It consists of two restrictions:
As we can see, this solves deadlock by violating deadlock condition 4 - a process can no longer wait for a resource while it holds a resource. The only time a process will be able to wait is when it does not hold a resource and is waiting to acquire its first resource. This is okay, because in this case, it cannot be holding up another process.
Method 2 seems much better than Method 1, because it can be done in O(1) - constant time. Yet all is not as it seems, and there are advantages to the slower, naive approach employed in creating a wait-for graph and looking for cycles.
Lock ordering depends on the system it is employed in being cooperative!
This is to say that all processes, and therefore all applications created on the system, need to be well-behaved. This may very well lead to confusion when an application developer attempts to create an application and receives errors about deadlock, as they may not understand the approach to locking that is established in the system.
Quasi-deadlock is when processes do not actually deadlock, but may as well be deadlocking due to lack of useful work being performed - or the wrong work being performed!
This is a type of quasi-deadlock. A case study, given in lecture, which applies to this kind of quasi-deadlock can be found in the case of the Mars Pathfinder in 1997.
What, exactly, is priority inversion?
Imagine three different levels of priority are established by a system: T_HIGH, T_MED and T_LOW, corresponding to a high priority, a medium priority, and a low priority, respectively. Now, a process that operates on priority T_LOW acquires a resource. Immediately afterwards, a process with T_HIGH is entered into the system, and a context switch occurs so that the higher priority process can run. It tries to acquire the same resource that the T_LOW process has already acquired. Since it cannot acquire that resource, it puts itself in a waiting state.
Normally this would be fine, because no cycles are created. Once the T_LOW process finishes, the T_HIGH process would be granted the resource.
However, the catch is the existence of T_MED processes in this system. Suppose that a process that is given T_MED priority enters the system sometime after the T_HIGH process begins running. As long as there are T_MED processes runnable, the T_LOW process will not be able to run, meaning that the T_HIGH process may wait indefinitely for the resource that is held by the T_LOW process.
It's easy to avoid priority inversion by simply designing a system without more than two levels of priority. However, additional levels of priority may be desirable. So there must be another general solution that will fix it:
When a high priority process waits for a resource, it passes its priority to the process that it is waiting on!
That way, medium priority process will no longer preempt the process it is waiting on, and the system avoids priority inversion.
Livelock is another type of quasi-deadlock. Processes are running and doing things, but in the big picture, they're actually stuck and doing no useful work.
For example, here's a graph of system load vs. work done in response to the requests:
Why does work done all too commonly decrease significantly as load increases, instead of continue to increase or flatten out?
Consider a buffer. Requests come in at a high rate, say 100 Gbs/sec, and the CPU processes them at a slower rate, say 10 Gbs/sec. We can consider the CPU as a bottleneck, perhaps, but the true problem lies in the fact that the CPU needs to do two operations to process the requests. The first is that the CPU needs to run an interrupt service routine, which stops what the CPU is currently doing and dumps the requests into its buffer. This runs on high priority. The second is where the work actually gets done, the work routine that finishes processing the requests. This runs on low priority.
Since the interrupt service routine runs on a higher priority, when too many requests arrive to the system, the CPU spends all of its time interrupting itself to dump requests into the buffer instead of doing useful work! Even worse, when the buffer becomes full because of this, the CPU will continue spending all of its time interrupting itself, just dumping the requests instead.
The solution to livelock is to disable the interrupt at a certain percentage of fullness. In this case, when the buffer is somewhat full, say 75%, the CPU should disable its interrupt service routine and focus entirely on processing requests.
Synchronization problems and deadlocks are inconvenient and annoying, and they arise from one main cause - that the system that has them is running two or more threads at one time.
The simple solution is to just use one thread! More specifically, we still want to wait for events, but we want to avoid having any data races in our system. The formal name for such a solution is event-driven programming, and it is popular in small systems.
Here's some pseudocode that approximates how an event-driven program
might work:
main program
while true {
wait for some event e;
e->handler(e);
}
Handlers can't wait in any way, shape or form! The only waiting that is done in an event-driven program is done via the while loop.
However, this means that there is a clear disadvantage to event-driven programming. Since handlers cannot wait, they must complete their operations quickly, or risk failing to handle future events. This will lead to a slow app and dissatisfied users.
In summary, here are the pros and cons of event-driven programming:+ no locks
+ no deadlock
+ no bugs due to forgetting to lock
Cons
- handlers are restricted (and this may be awkward - complex operations
must be broken up into smaller handlers)
- performance (remember that this is popular with small systems; it does
not scale well, and cannot scale with multithreading by definition)
Event-driven programs can scale, but they typically do it via multiprocessing or even using multiple systems.
Typically, the CPU spends time acquiring locks. How much time is spent doing this when an average application is run? Hint: the answer is too much, especially when we also consider the time that is spent waiting for locks.
So we get hardware support, or Hardware Lock Elision. This is a way to get the effect of having a lock without actually getting one!
This works like this (assembly code example follows, using Haswell instructions):
lock
movl $1, %eax //if unsuccessful we didn't
get the lock
try:
xacquire lock xchgl %eax, lock
cmp $0, %eax // look at the value - if
it's 0, we got the lock!
inz try //
otherwise, we try again
ret
unlock:
xrelease movl $0, lock
ret
What is important to note is that this code runs without prefixes - it can run even when 'xacquire' and 'xrelease' are not written. The difference is that the prefixes make the code run optimistically - that is, the code assumes true (that it is able to acquire the lock) and if it later finds that it was wrong, it is able to use its cache to say that all the instructions that it executed did not happen!
The downside of Hardware Lock Elision, however, is that its busy-wait looks confusing, as the restoration of the cache may make the program jump back to a point in the assembly code that would be theoretically impossible from the way that the code appears. For example, it could jump from the unlock region of the code to the lock region.
At the hardware level, this is generalized to transactional memory.
Now that we are finally finished with the deadlock portion of this lecture, we can move onto file systems.
Let's examine an example file system. It is comprised of 120 petabytes (PB) of memory, which is equivalent to 120 million GB. This is not all on one drive. This is made out of 200,000 hard drives which each contain 600 GB of memory. The main reason for the multiple drives is that they are fast and cheap - it would be very difficult if not impossible to find a 120 PB hard drive, and even if one could be found, it would likely cost an exorbitant amount.
Our file system also is running GPFS (General Parallel File System), which is a file system intended for larger file systems. It would not likely be found on a home computer or a laptop.
We want to make this file system as fast as possible, so we need to look into ways to increase its performance.
Imagine that we place all the disk drives next to each other, disk 1 to disk N. This will speed up sequential accesses. However, if the file that we are accessing gets too long, we will need to utilize huge buffers, which is inconvenient. This is solved by accessing the disks in parallel (remember that this is running the General Parallel File System), and thus gaining the advantages of striping without the disadvantage mentioned above.
Striping does have another disadvantage in that data is easily corrupted, given that data is spread over several disks, but backups and redundancy help to prevent this from occurring.
What is metadata, exactly? Metadata is data about the data that is stored on disk. For example, metadata consists of things such as file location, owner, and timestamp.
Distributed metadata is exactly what it sounds like. Imagine a cloud of nodes. We might think that one is a "control" or "master" node that contains metadata, while the others contain data, but it is in fact more efficient to have all the nodes retain metadata. There is no singular "control" node, and this makes metadata easier and faster to access.
Suppose we have a directory with a million files. This is not a directory that has directories in it, which in turn have a couple dozen or hundred or thousand files that add up to a million in number. This is a directory with a million files directly in it. With the wrong storage method, this could become difficult to access and use. We want the directory to work regardless of how many files are stored in it.
The solution to this is to store the files in an internal data structure, such as a binary tree. Those will increase performance to O(log n).
Sometimes data is partitioned, or split. When a system has network partition awareness, it is aware that the data has been split. The largest partition stays live in this case, retaining the ability to do reads and writes, but the smaller partitions are "frozen," and can be read from but not written to. Typically, the smaller partition should just be a small amount of data nodes that are disconnected - it is not often the case that the data is split anywhere close to half-and-half.
This is fairly self explanatory. When maintenance is being done, the file system is not shut down.
Notice that the majority of these ideas are dealing with performance, as that was the goal. Performance is a major issue with file systems!
But how exactly should we measure performance? Much like when we measured the performance of schedulers in a past lecture, we can define metrics for the performance of a file system.
We want to minimize latency and maximize throughput and utilization.
Let's assume a system with the following characteristics:
So, let's suppose that we are attempting to read bytes from disk into a buffer, and then subsequently performing some operation on the bytes (in this case, we'll say encryption). Here's our first try:
for ( ; ; ) {
char buf[40];
read 40 bytes from SSD into buf
compute(buf);
}
Now, we can examine the metrics of this method. All units will be in microseconds.
Our latency: 5 to send to SSD + 50 for SSD latency + 40 to issue PIOs +
5 for computation time = 100 microseconds in total (for each loop)
Throughput = 1/latency = 10,000 requests per second
Utilization = 5%, as we have 5 microseconds of computation time per 100
microseconds.
We can note that the utilization in particular appears to be poor. 5% might be good enough for an embedded system, but it is a bit of a waste in others.
In an attempt to speed it up, we can use several methods:
This is the pseudocode, modified in a way so that batching is implemented:
for ( ; ; ) {
char buf[840];
read 840 bytes from SSD into buf
for (i=0 ; i<21 ; i++)
compute(buf+i*40);
}
In batching, we grab a large amount of data from the disk before computing it all at once. Let's compute metrics again.
Latency (in microseconds): 15 to send to SSD + 50 for SSD latency + 840
to issue PIOs + 105 for computation time = 1000 microseconds in total
Throughput = 21,000 requests per second
Utilization = 105 microseconds spent computing / 1000 total = 10.5%
Judging from these metrics, we double our utilization and throughput, but increase our latency.
Again, some pseudocode:
for ( ; ; ) {
block until there's an interrupt //process goes to sleep
do handle interrupt
while (!ready)
}
This uses a blocking mechanism and an interrupt. Calculating the metrics again:
Latency (in microseconds): 5 to send to SSD + 50 for blocking + 1 to
check ready + 40 to issue PIOs + 5 for computation time = 106
microseconds in total (however, 56 microseconds when we exclude the
blocking and checking)
Throughput = 1/56 microseconds = 17,857 requests per second
Utilization = 5/56 = 8.9%
Not as impressive as batching in throughput and utilization improvement, but latency remained low.
for ( ; ; ) {
write command to disk
block until interrupt
check disk is ready
compute
}
Latency (in microseconds): 50 for blocking + 5 to handle + 1 to check
ready + 5 for computation time = 61 microseconds in total (11 when
blocking is excluded)
Throughput = 1/11 microseconds = 91,000 requests per second
Utilization = 5/11 microseconds spent computing / 1000 total = 45%
This is a pretty big improvement, and is a large reason why DMA is used so often.
Let's look at both DMA and polling used together.
for ( ; ; ) {
write command to disk
while (DMA slots are not ready) // Note: a DMA slot is
available room to make requests to the disk
schedule();
check disk is ready
compute
}
It is essentially the same code, except we poll instead of blocking. Now for metrics:
Latency (in microseconds): 50 for rescheduling + 1 to check disk + 5
for computation time = 56 microseconds in total (6 when rescheduling is
excluded; however, we assume that checking the disk takes 1 microsecond
optimistically)
Throughput = 1/6 microseconds = 166,667 requests per second
Utilization = 5/6 = 84%
When we compare it to all the other options, it is clear that this minimizes latency and maximizes throughput and utilization the best out of all the options:
Normal | Batching | Interrupts | DMA | DMA-Polling | |
Latency (microseconds) | 100 | 1,000 | 106 | 61 | 56 |
Throughput (requests/sec) | 10,000 | 21,000 | 17,857 | 91,000 | 166,667 |
Utilization | 5% | 10.5% | 8.9% | 45% | 84% |
Something important to note: the DMA-polling method essentially waits for events to occur and then handles them. We have now established a system that uses event-driven programming!