CS 111 Spring 2014
Lecture 10: File System Performance (With More on Deadlocks)
by Kevin Sheu and Zeeshan Pirzada
Table of Contents
- More on Deadlock
- Quasi-Deadlock
- Priority Inversion
- Receive Livelock
- Event Driven Programming
- Hardware Lock Elision (HLE)
- File System Performance Ideas
- Performace Metrics for File Systems
More on Deadlock
Deadlock has 4 conditions:
- Circular Wait - processes are waiting for resources held by other processes, which are waiting for other resources
- Mutual Exclustion - holding a resources excludes other processes from getting that resource
- No Pre-emption - processes do not have the ability to take resources from others if they have it
- Hold & Wait - processes are holding one resource while waiting for another
Solution: In order to avoid deadlock, you would need to design a system without one of these things.
- The circular wait is the most popular to attackin system design.
To avoid deadlocking, we would need to look for cycles on each wait operation, such as the one shown above. Then, we would not allow a process to acquire another resource.
The cost for checking for this type of deadlock with a
for
loop with
P processes and
R resources would have time completxity of
O(PR). However, this is a naive implementation, and we would obviosly prefer a solution with a significantly lower order, ideally O(1). Another solution for this would be
lock ordering:
- Processes would lock resources in increasing order (e.g. by machine address)
- Every resource would be placed in this ordering, and the ordering would be the same for every process.
- If a process finds a lock that is busy, it will give all of the locks up and start over
- This eliminates the deadlock condition of hold and wait.
- O(1)!
Quasi-Deadlock
- Solution: high priority waits for resources, and lends its priority to who has the resources.
- ex) T-low becomes high till releases lock, then it becomes T-low again
Recieve Livelock
Below is a graph showing the typical trends for receiving requests:
Here is a scenario to explain the awful curve:
Problem: The CPU is doing two things:
- The interrupt service routine grabs data and puts the data into the bounded buffer.
- Computations occur during the work routine
However, too many requests are coming in, so the CPU becomes swamped because the interrupt service routine has a higher priority.
Soution:
- Check how full the buffer is when the work routine starts
- if it fills to a certain percentage (e.g. ~90%), disable and ignore the interrupts service routine
- if the buffer % falls to say 50%, continue to accept interrupts, and the graph will straighten out.
Event-Driven Programming
Data Races are Evil:
- Deadlocks occur because resources exist that need to be held and locked.
- The problems and solutions exist because of data race conditions
- Essentially, the underlying problem is that we have more than one thread running.
Alternative Solution: just use one thread
- However, we still want to be able to wait for events.
- This approach is called event driven programming.
Example:
0 while (true) {
1 wait_for_event(e);
2 e->handler;
3 }
- Key Idea: the handlers themselves can not wait
- only part of the system that waits is
wait_for_event(e)
- Fundamental Property: the handler has to be fast enough so processes do not have to wait long for others
Positive (+)
- No locks
- No deadlock
- No bugs (that arise from deadlocking)
Negative (-)
- Handlers are restricted (we may be forced to break handelers)
- Can not scale via multithreading, so we are forced to scale via multiprocess or multisystem
Hardware Lock Elision
- One major problem with spin locking is that too much CPU time is spent waiting for the locks to lock/unlock.
- Solution: Get hardware support in order to get the effects of spin locks without using spin locks
Example Code:
0 lock:
1 movl $1, %eax
2 try:
3 xacquire lock xchgl %eax, lockvar
4 cmp $0, %eax
5 jnz try
6 ret
7 unlock:
8 xrelease movl $0, lockvar
9 ret
- The
xacquire
prefix allows optimistic execution of a critical section by eliding the lock, so the lock appears free to other threads.
- The
xrelease
prefix lets us check if the lock was actually obtained. If it wasn't, then the previous operations are undone and are tried again.
File System Performance Ideas
Below are some ideas that can be used to enhance a file system's performance:
Striping
Assign a piece of file on to each disk, so we can read in parallel
Distributed Metadata
- metadata = info about file location, ownership, timestamp, etc.
- Every node will contain metadata about the entire file system
- Each node effectively becomes a master node for the system
Efficient Directory Indexing
- If a directory contains over 10,000,000 files, there will need to be an efficient internal data structure, such as a B+ tree.
Partition Awareness
- Ideally, the file system will spread over the cloud
- Occasionally there will be network partition (as a result of accidents)
- If you are accessing smaller side, there will be only read only access
- if you are accessing the larger side, you can both read and write.
- Basically, the largest partition stays live
File System Staying Live During Maintenence
- Upgrading, adding files, removing files, etc. will not affect whether the system is still functioning
Performance Metrics For File Systems
Performance metrics:
- Latency - delay from request to response
- Throughput - rate of request completions
- Utilizations - function of capacity doing useful work (CPU, bus bandwidth)
Example: Assume a system with these characteristics:
- CPU cycle time = 1 nanosecond
- Program Input/Output (PIO) = 1000 cycles = 1 microsecond
- Sending a command to SSD (Secondary Storage Device) = 5 PIOs = 5 microseconds
- SSD Latency = 50 microseconds
- Computation per buffer = 5 microseconds
- Reading 40 bytes from SSD = 40 PIOs = 40 microseconds
- Blocking overhead = 50 microseconds
Basic busy wait + polling approach for reading:
for(;;) {
0 char buf[40];
1 read 40 bytes from SSD to buf;
2 compute(buf+i*40);
3 }
Latency: (microseconds)
Sending command to SSD |
5 |
SSD latency |
50 |
Read latency |
40 |
Computation |
5 |
Total |
100 |
- Throughput = 1/latency = 10,000 requests/second
- Utilization: 5/100 = 5%
- This is okay, but we can definitely go faster!
Below are different ideas for speedup:
Batching
Basic Idea: grabbing bigger blocks of data (bigger buffer)
0 for (;;) {
1 char buf[840];
2 read 840 bytes from SSD into buf;
3 for (i=0; i<21; i++)
4 compute(buf+i*40);
5 }
Latency:
Sending to SSD |
5 |
SSD latency |
50 |
Read latency |
21 * 40 = 840 |
Computation |
21 * 5 = 105 |
Total |
1000 |
- Throughput = 21/1microsecond = 21,000 req/sec
- Utilization: 105/1000 = 10.5%
Interrupts
0 for (;;) {
1 block_until_interrupt;
2 do
3 handle_interrupt;
4 while (!ready)
5 }
Latency:
Send to SSD |
5 |
Blocking |
50 |
Handling |
5 |
Read latency |
40 |
Computation |
5 |
Total |
106 |
- Throughput = 1/50 microsecond = 17,657 req/sec
- Utilization: 5/56 = 8.9%
DMA (Direct Memory Access)
0 while(1) {
1 write command to disk;
2 block until interrupt;
3 check disk if ready;
4 compute;
5 }
Latency:
Blocking |
50 |
Handling |
5 |
Checking if ready |
1 |
Computation |
5 |
Total |
61 |
- Throughput = 1/11 microsecond= 91,000 req/sec
- Utilization: 5/11 = 45%
However, we can give up on interupts and just use the polling approach.
DMA + Polling
0 while(1) {
1 write command to disk;
2 while (DMA slots are not ready)
3 schedule;
4 check disk if ready;
5 compute;
6 }
Latency:
Blocking (scheduling) |
50 |
Checking if ready |
1 |
Computation |
5 |
Total |
56 |
- Throughput = 1/6 microsecond= 1666,666 req/sec
- Utilization: 5/6 = 84%
We can see that the utilization rate is extremely high for DMA with polling. Consquently, this is one of the more popular methods!