Table of Contents

Top
Deadlocks and File System Performance
Recursive Livelock
Hardware Lock Elision (HLE)
File System Performance
Performance Metrics

CS 111 Lecture 10 Scribe Notes 
by Gagik Matevosyan and Rossen Chemelekov


Deadlocks and File System Performance:

Overview: Enter some cute overview here, explaining what we did last lecture, etc...
More on deadlocks:
A deadlock has 4 conditions:

  1. Circular wait: a process holding a resource is waiting on another resource which is held by a process waiting on the first process's resource.
  2. Mutual Exclusion: no two concurrent processes execute their critical section at the same time.
  3. No preemtion of locks: a lock on resource can only be released by locking process, cannot be pre-empted by OS
  4. Hold & wait: a process holding a resource waiting for resource(s) held by another process.

Circular wait is the most popular condition tackled in system design. One way this is done is having 
the resource manager look for cycles on each wait.
Loop  

What is the cost? Given P processes and R resources, the cost of checking for loops is: O(PR) (?).

Can be improved to O(1) by Lock Ordering:
Assumption: cooperative application.
i) Process lock of resources in increasing order (of machine address).
ii) If you find a lock busy, give up all your locks and start over.
loop_2


* After i) & ii), deadlock is broken because step 4 of deadlock conditions (Hold & wait) never happens.
Problem: Lock ordering assumes (relies on) your applications being cooperative!
Solution: OS can enforce ordering. Downside is larger overhead. 

(Quasi-)deadlock: Priority Inversion (killed off Mars Pathfinder rover in 1997)
Assumption: A preemptive scheduler with priorities.
quasi

Solutions to Priority Inversion: (Eggert: "Don't allow 3 levels of priority. Duh.") 
- Low-prioirity process holding lock required by a high-priority process is temporarily given the priority 
of the highest prioirity process requiring the lock to avoid being preempted by another process, then
releases the lock returning to its low prioirity.
quasi2

Recursive Livelock: 
recursive


High priority tasks swamp the CPU and low priority tasks take a long time to get (or are never) done.
recursive_buf

Solution to Recursive Livelock: Work routine can check buffer fullness. If buffer is full, disable
interrupts temporarily, allowing the CPU to get work done before processing new requests.

[Obvious Conclusion] Data races are evil! If not for data races, we wouldn't need locks, mutes, etc. Problems with data races arise because we have 2 or more threads.
Alternative: Use just one thread. Of course, we still want to be able to wait for events.
Event-driven Programming: Single thread only.

        while (true) {
               wait(e);          // wait for some event e to happen
               e -> handler(e);  // and handle it
        }
* Handlers themselves never wait, they try to acquire/release a lock. Thus, they must finish fast, otherwise they may lead to stalls.

       - no locks
       - no deadlocks
       - no bugs due to forgotten locking

      - handlers are resticted - for bigger tasks, need to break into smaller pieces
      - can't scale via multithreading (SMP) + shared memory
Problem: Too much time spent waiting for locks.
Solution: Get hardware support. Hardware Lock Elision: Pretend to have a locka and act accordingly.


Hardware Lock Elision (HLE):
Example (Haswell instruction set):

        lock:
               movl $1, %eax 
        try:                                                // trade register with memory
               xacquire lock xchgl %eax, lock
               cmp %0, %eax                      // if 0: lock acquired successfully
               jnz try                                   // otherwise, try again
               ret                                       // if lock not acquired, return
        unlock:
               xrelease movl $0, lock        // release lock
               ret
* On older systems this would still work without the xacquire and xrelease commands. 
How can a CPU determnine if instruction never happened? Through the cache. 

If you wish to know more, look into Transactional Memory. 


File System Performance:
Example File System:

General Parallel File System (GPFS)
Performane ideas:
1. Striping - Split file into multiple pieces and put on different disks (all can be read simultaneously).
striping  

              
2. Distributed Metadata: - Tells you information about location, ownership, timestamp, etc. (kept in a cluster of nodes, rather than a single location). In the cluster of notes everything is a data note, potentially a master node.
nodes  

3. Efficient Directory Indexing:
indexing


4. Partition Awareness: - Largest partition stays alive.
partition


5. File system stays alive during maintenance


How do we measure File System Performance?
Performance Metrics:

As previously done, we want to minimize latency and maximize Throughput and Utilization. 

Example with a file system having the following characteristics:

Using same algorithm as we did with booting:
Busy wait + polling:

        for(;;) {
               char buf[40];
               read(fd, buf...); //read 40 bytes from SSD into buf;
               compute(buf);
        }
For each buffer:

Latency (micro seconds) = (sending command to SSD) + (SSD latency) + (read latency) + (compute) =
                       = 5 + 50 + 40 + 5 = 100 micro sec
Throughput = 1/latency = 1/(100 micro seconds) = 10,000 requests/second
Utilization = (50,000 requests)/(1,000,000 sec) = 5%
Ideas for speedup:
Batching - get a big block, rather than a few bytes. First answer will be slower
because of the bigger batch. Overall, throughput and utilization will go up.

changed code:
        for(;;) {
               char buf[840];
               read 840 bytes;
               for (i=0; i<21; i++)
                       compute(buf + i*40);
        }      
Latency (micro seconds) = (sending command to SSD) + (disk latency) + (read latency) + (compute) =
                       = 5 + 50 + 840 + 105 = 1000 micro sec
Throughput = 21,000 requests/second
Utilization = (105 requests)/(1,000 sec) = 10.5%

Interrupts:

        for(;;) {
               block until there's an interrupt
               do
                       handle interrupt
               while(!ready);
        }      
Latency (micro seconds) = (send) + (block int.) + (handle int.) + (check ready) + (read latency) + (compute) =
                       = 5 + 50 + 5 + 1 + 40 + 5 = 106 micro sec
Throughput = 1/(56 micro seconds) = 17,857 requests/second
Utilization = (5 requests)/(56 sec) = 8.99%

Direct Memory Access (DMA):
        while(1) {
               write command to disk
               block until interrupt
               check if disk is ready
               compute
        }      
Latency (micro seconds) = (block int.) + (handle int.) + (check ready) + (compute) =
                       = 50 + 5 + 1 + 5 = 61 micro sec
Throughput = 1/(11 micro seconds) = 91,000 requests/second
Utilization = (5 requests)/(11 sec) = 45%

- DMA + Polling: Rather than blocking until interrupt, you let someone else do useful work.

        while(DMA slots not ready)
               schedule();
Latency (micro seconds) = (waiting) + (check disk) + (compute) =
                       = 50 + 1 + 5 = 56 micro sec
Throughput = 1/(6 micro seconds) = 166,667 requests/second
Utilization = (5 requests)/(6 sec) = 84%