Authors: Brandon Downs, Yunzhong He, Kyoseong Ku, Zhuo Li
File system is a data structure that lives in memory and secondary storage devices.
When designing a data structure that lives in different places, we ideally want good performance from a file system.
How do we measure "good"? How can we tell if a file system is performing well?
The way we measure a file system's performance is by looking at its latency/response time, throughput, utilization, and capacity.
Latency is the difference between the resquest and response times, which we would like to minimize.
Throughput is the measure of rate of request completion, or how many requests are satisfied per second. Usually throughput would measure something meaningful, such as the number of I/O o perations satisfied per second.
Utilization is fraction of time that underlying system is doing useful work, which is related to throughput. Utilization that is as close to 100% as possible is desired.
Capacity is how much data the fs can hold. In order to expand capacity, we can get a file system to have more space than its physical capacity by using compression. For example, we can have 2TB on a 1TB disk.
Latency and throughput are the most important, and there are two other low-level topics to consider: fragmentation and the transfer rate between one part of a file system to another. Reducing fragmentation is ideal, but the reason why this is considered low level is because the way the files are scattered among the disk is irrelevant to the user. What the user cares about is that the files exist and are accessible. Similarly, the transfer rate between one part of a file system to another is low level because what the user cares about is how much data is appearing at the user level.
Low latency and high throughput are desired.
Assume a device with following characteristics
for(;;) { char buf[40]; read 40 bytes in buf; compute(buf); }
Latency = send command (5 microsec) + device (50 microsec) + read (40 microsec) + compute (5 microsec) = 100 microsec for each loop
Throughput = 1/latency = 10,000 requests per second
CPU utilization = (5 microsec compute operations) / (100 microsec for each loop) = 0.05 or 5%
Improve performance by increasing buffer size and batching
Instead of reading only 40 bytes at a time, read multiple
enum{BATCHSIZE = 21} //To read 21 buffers at a time for(;;) { char buf[40*BATCHSIZE]; read 40*BATCHSIZE bytes into buffer; for(int i = 0; i < BATCHSIZE; i++) compute(buf + 40*i); }
Latency = send command (5 microsec) + device (50 microsec) + read (40*21 = 840 microsec) + compute (5*21 = 105 microsec) = 100 microsec
Average per-buffer latency is less. 1000 microsec is for entire batching.
Throughput = batched version is not 1/latency
Request size = 21 buffers' worth and getting 1000 requests per second, but requests are 40 bytes, so 21000 40-byte req/s
CPU utilization = 105/1000 = 0.105 or 10.5
According to the computation above, the overall performance improves if we consider CPU utilization, but the downside is the latency goes up. The question posed here is can we be faster that is better than batching? The answer is yes by using interrupts.
The old way:
for(;;) { write a command to device; while(!ready) continue; read data from device; compute(); }
The while loop is a bottle neck. While sitting around and waiting, we should/could get some work done. We can compute previous buffer. Since we can't compute data that we haven't read, we can compute on data that has been read before while we wait for next I/O.
Replace the while statement above with this
do { wait for interrupt; // CPU just stalls here handle interrupt; } while (!ready);
Latency = send command (5 microsec) + wait for interrupt block (50 microsec) + overhead of handling interrupt (5 microsec) + check ready (1 microsec) + read (40 microsec) + compute (5 microsec) = 106 microsec
Throughput = 1/(56 microsec) = 17,857 req/s
CPU utilization = 5/56 = 8.9%
In order to further improve the performance, we can also combine interrupts and batching.
Device controller sends data to memory directly
for(;;) { issue a write command to the device; block until interrupt; handle interrupt; check that device is ready; compute; }
Latency = block (50 microsec) + interrupt handler (5 microsec) + check (1 microsec) + compute (5 microsec) = 61 microsec
Throughput = 1/11 = 91,000 req /s
CPU utilization = 5/11 = 45%
Now, we cannot compute faster, so to improve performance we need to get rid of handler.
Running program flat out as fast as possible
while(DMA slots not ready) schedule();
Insert the two lines in place of "block until interrupt" above in DMA. Give the CPU to another process so that it can do something useful. This avoids the overhead of interrupt handler. The idea of DMA + polling is try to keep CPU busy doing useful work. If it is not doing useful work, give the control to other processes. Here, we assumed schedule()'s cost is 0 above, and assume 2 microsec below. Overall, DMA + polling is still better even if schedule()'s 2 microsec latency is considered into the calculations.
Latency = block (50 microsec) + check (1 microsec) + compute (5 microsec) = 56 microsec
Throughput = 1/(6 microsec) = 166,667 req/s
CPU utilation = 84%
METHOD | LATENCY (microsec) | Throughput (40KB/s) | CPU utilization (%) |
---|---|---|---|
Naive | 100 | 10 | 5 |
Batching | ~1000 | 21 | 10.5 |
Interrupts | 106 | ~18 | 8.9 |
DMA + interrupts | 61 | 91 | 45 |
DMA + polling | 56 | 167 | 84 |
If polling is used, we assume that all processes are well-behaved.
Linux-like systems use interrupts because we can't assume that all processes are well-behaved.
The two main devices used for I/O = disk and flash
The most bothersome figures are 4.16ms average latency and 8.5ms average for read seek. So can we only worry about the 8.5 ms value because we can move track while searching rotationally? The answer is no, because it is hard to overlap the two operations since we don't know where to be reading from ahead of time. The two has to be added which is approximately 13ms.
Prefetching is getting data from device (into a RAM cache) before application requests it so that if application is reading, assume subsequent sectors will also be read and move them to cache. This can be done in application (~batching), OS, or disk controller. Prefetching can hurt performance by wasting RAM, tying up the disk while doing prefetching, and then find out data wasn't needed.
Speculation is generalization of prefetching, which in another word is guess what future requests will look like. The idea is to do more work to satisfy future requests. Speculation can hurt performance sometimes.
Batching is common use of prefetching (for reads).
Batching (for writes) is cache what you want to write in kernel, but don't actually store in disk yet, and wait until buffer is filled before writing to disk. However, the downside is data is not actually written--not persistent storage.
Dallying is wait for a similar order. The idea is don't write immediately, wait for a dallying interval 10ms or so, and wait for another write command and execute both together. The downside, however, is data is nto actually written--will lose 10ms of data or so.
Now, we assume there are lots of IO requests to a disk, and we have to decide which request to do next. Simple model is a disk is a collection of blocks 0 to n-1, each block is a disk sector that current head position is somewhere in the middle of disk (0 <= h < n), cost of accessing block i is the distance abs(i-h), can do first come first serve FCFS, and can also do shortest seek time first SSTF.