It would be nice if we could get our spinlocks to go faster, in particular the primitive xchgl that slows things down.
On Haswell architectures (and others, but not SEASnet), the Hardware Lock Elision does the locking, but it “cheats” to get performance. On x86, it executes the xchgl instruction, but with a new prefix, xacquire, that modifies the way xchgl behaves:
With the Hardware Lock Elision, multiple threads can execute critical sections protected by the same lock if they aren’t performing operations that could conflict. The “xacquire” prefix allows this to occur.
Further reading
Simple primitives like deposit/withdraw won’t deadlock because they do very simple tasks. However, we can get into trouble when we deal with more complicated code. For example, consider the following transaction:
transfer($3, acct #19, acct #27);
An implementation of this transfer function might involve locking both account 19 and 27. But say you have two threads which are each calling transfer() and are trying to lock the other’s accounts, so they spin forever. You can’t prevent this problem by using hardware locking or condition variables, so this is a classic deadlock example.
Deadlock is a global property of your system, and it’s not obvious by looking at one part of the code that it will be dangerous. These properties are like death in operating systems because it’s not immediately apparent that there’s an issue.
We’ve seen examples of global, non-obvious problems before (e.g. race conditions). By trying to handle race conditions, we’ve introduced so many locks that we’ve made another race condition. Deadlocking is a race condition!
If you can cross out any one of these 4 things, you won’t have deadlock.
Priority Inversion is another global, non-obvious connection between locking and scheduling that can cause a host of problems.
Suppose you have three threads, L, M, and H of low, medium, and high priority, respectively, that all need to use a resource. If L is currently using the resource and H wants to use it, H will be blocked because L still has the lock. In an ideal world, we’d like to have H get immediate control of the lock when L is done. However, suppose M also becomes runnable when L is still using the lock. Because M is higher priority than L, the resource could be given to M, so H would not get the chance to use the resource.
This problem actually occurred with JPL’s Mars Pathfinder in 1997. So how would we go about solving this issue?
Lock preemption could work, but is too disruptive. It would boot processes working with fragile data structures and lose their work. So instead, as a fix we could temporarily assign H’s priority to L while it’s executing so that M doesn’t step in afterwards (since H > M).
Livelock occurs when threads are running (i.e. not blocked), but you’re still not making progress. It’s similar to deadlock, and to some extent, it’s even more insidious than deadlock because it is hard to debug since everyone seems like they’re getting work done.
The following illustrates an example of livelock:
Assume we have a bounded buffer which takes requests from the outside world at a high rate, but our system can only handle requests at a lower rate. As long as there isn’t too much network traffic, we’re fine because we can just queue requests.
In a lot of cases, though, we won’t be able to handle all of those requests. To process these requests, we need to do the computation in addition to handling interrupts. In the normal way of treating it, interrupts will have higher priority, because you don’t want to lose the data. So we’ll have to sacrifice some of that CPU time for handling interrupts, and performance stinks because you end up wasting a lot of CPU time stuffing things into memory that you’ll never get to.
Ideally, we’d like to spend as much of our CPU time doing valuable work. We can achieve this by masking interrupts once the buffer is full, and allowing interrupts again when the buffer is empty.
Locking code is a hassle – there’s a huge potential for bugs, and it can be extremely confusing. Is there any way to prevent race conditions without dealing with the pain of locking code?
As it turns out, there is a way to achieve this. It is called event driven programming.
The basic idea of event driven programming is to first throw out the idea of threads. We still want to be able to deal with asynchronous events, but we won’t use threads for this. Instead, we organize our computation into a series of events, so that our program is a set of event handlers. At runtime, the system maintains a priority queue of incoming events and runs each event handle. Each event handler never waits; the goal is for it to just finish quickly.
The advantage of this approach is that each event handler is a critical section, so nobody can take over. Unfortunately, however, this approach doesn’t scale to multiple cores, so we don’t use it.
We’ll now switch gears and discuss file systems. File systems control how data is stored and retrieved.
One example of a file system is the IBM General Parallel File System (GPFS), which is a high performing file system with 120PB of data and 200,000 hard drives [each 600GB, 15kRPM]. The drives are small for cost reasons and because smaller drives allows more parallelism.
Because we can’t fit all 200,000 hard drives, we’ll instead package a few into a group called a “node” and have a network of nodes. A network is a bunch of computers working together to provide the illusion of a single computer with this large file system.
A hard disk is cheaper per byte, but with flash, the random access means it will be faster.
The problem with hard disks is that they are spinning magnetic disks with the read head waiting at the same spot, so there is latency:
Let’s compare the specs of a particular hard disk vs. flash memory:
A downside of flash is that it wears out. For this, there is a common technique called wear leveling. The flash controller will effectively “lie” about where it does the writes. You think you’re writing to block x but it instead moves it to another block to negate/spread out hot spots (areas where blocks are written more often).
Historically, flash drives have a very bad reputation for reliability because amateurs write the wear leveling code. So it’s fast, but you might lose all your work if you choose a drive with bad wear leveling code.
At any given point in time, your OS will have a set of requests for each disk or flash drive. Your job as a scheduler is to decide which to do next.
Suppose we use the simplest scheduling algorithm that we used for CPUs to apply to disk: first come, first served! Assume that we have an abstract device which acts like a disk (meaning seeks are required), and assume that the cost required is proportional to the distance.
Question for next time: Is first come first serve a good algorithm for disk drives?
(The answer is no.)