File Systems
What We Value in a File System
- Performance
- Organization
- Fast to find stuff (if you know where it is)
- Easy to change things
- Space, lots of it
- Security
- Reliability
History of disk performance
CPU performance has been growing exponentially since the 1960s. Hard disk speeds have only increased from 3600 rpm in the 1960s to 15000 rpm in the present. File systems currently in use were designed in the 1980s, where CPU and hard disk speeds were similar. New designs should predict the change in parameters in the next 1-2 decades.
Hard Disks Structure
The shape of data that is easiest to read is a cylinder.
To read a different cylinder:
- Accelerate arms in right direction
- Start decelerating arms about halfway, so that they stop in the right track
- Wait for read heads to settle in treads
- Wait for data to rotate under head
Example Hard Drive: Seagate Barracuda LP 2
- 300 Mb/s external transfer rate - disk buffer to bus
- 90 Mb/s sustained data rate - disk read speed: mostly speed of spinning disk
- 32 MB cache - buffer to hold r/w data
- 5.5 ms avg latency - time until half disk rotates
- 10 ms average seek time - one sector of data to another at random (NB: this number is made up)
- 5900 rpm spindle speed = 98.33 Hz
- 50,000 contact starts/stops - number of times you can power hard drive on/off
- 10-14 nonrecoverable read errors/bit
- 0.32% annualized failure rate (for a constantly running disk)
- 6.8 W average operation (performing operations)
- 5.5 W idle without I/O
- 70G operating shock for earthquakes
- 300G nonoperating shock for earthquakes
- 2.5 bels acoustic (idle) - noise primarily from spindle
- 2.6 bels acoustic (operating)
An estimated average of 10,000,000 instructions can be processed (at 1 ns per instruction) in the time it takes to find another sector. This is slow. A faster option is flash, or a solid state drive.
Example SSD: Corsair Force GT
- 3 Gb/s external transfer
- 280 Mb/s read sequential
- 270 Mb/s write sequential
- 1,000,000 hours mean time before failure (not counting electronic wear from writing)
- 1000G shock
- 20W max power consumption
- 50 kilo I/O operations per second
While SSD performs better, at $90 for 60GB vs. $150 for 2TB on a hard disk, they are not entirely economical and thus cannot currently replace hard drives
File System Performance
Strategies to Improve Performance
We can use these methods to increase the performance of a file system without altering the hardware
Speculation (read):
The file system guesses what the application will do and does that work ahead of time.
- prefetching: guessing where application will read from later. This allows us to maximize the disk busy time.
- batching: application asks for 512 bytes, file system reads 8192 bytes at a time. If they ask for the next sector sequentially, then you have avoided latency for that call.
Dallying (write):
The application tells the filesystem what to do, but the filesystem waits a bit before doing it, with the idea of saving work overall by grouping together related tasks. Dallying can also be applied to read.This type of batching is especially useful when the file system caters to multiple applications, as you can then move the disk arm less.
Locality of Reference:
File access is not random in practice. Exploiting various forms of localities allows you to manipulate when you execute reads and writes for maximum efficiency. Forms of locality of reference include:
- spatial locality: data is all in the same area
- temporal locality: if a section has recently been referenced, it is likely that it will be referenced again in the near future.
- branch locality: useful for situations such as loops, the possible outcomes are restricted to a few locations.
- equidistant locality: halfway between spatial and branch, if the spatial/temporal distance between accesses is regular, the next location can be predicted.
Performance Metrics
- latency: delay between request & response
- throughput: how many requests/second can the file system handle
- utilization (of CPU): percentage of CPU devoted to useful work
- utilization (of disk): percentage of disk devoted to useful data
Though you generally want utilization to be high, you may prefer throughput performance over a short latency time. In the perspective of a systems operations administrator, who wants to maximize productivity for all users, a higher through throughput means more work down across all users.
File System Example: Imaginary Machine
- 1 ns cycle time (1 instruction/cycle)
- 1 PIO = 1000 cycle = 1 µs
- 5 PIOs to send a command to disk = 5 µs
- 50 µs disk latency
- computation reads data, 125 cycles/byte = 0.125 µs/B
- 1 PIO/byte to read
We will use the above machine as an example for our file system programs, because of their simple numbers. The following is a slow, basic program to process data in a file system. We can then apply various techniques to attempt to increase the speed and utility.
Initial Method
for (;;) {
char c;
if (sys_read(&c) < 0)
break;
process(c);
}
send command | latency | inb | compute | total |
5 µs | 50 µs | 1 µs | 0.125 µs | 56.125 |
- latency: 56 µs
- throughput: 1/56.125 µs = 17817 requests/sec
- utilization: 0.125/56.125 = 0.2%
Method 0.5: Batch reading
In the first method, the initial latency is per byte. If we batch read in sectors of 40 bytes, we can distribute that value over the entire block:
for (;;) {
char buf[40];
if (sys_read(buf, 40) < 0)
break;
process(buf,40);
}
send command | latency | inb | computer | total |
5 µs/40 | 50 µs/40 | 40 µs/40 | 5 µs/40 | 2.5 µs |
- latency: 90µs
- throughput 2.5 µs/B: 400000 bytes/sec
- utilization: 5%
Method 1: Overlap Input & Computation
If we use device interrupts, we can parallelize the IO and computation time in our batched process.
do {
block until there’s an interrupt
handle interrupt // issue PIOs for next request
// next request
// compute
} while(more data);
send command | latency (50µs)in parallel with compute (5µs) | inb | total |
5 µs/40 | 50 µs/40 | 40 µs/40 | 2.375 µs |
Though we have combined IO and computation time, there is not significant improvement if we use a 40B block because the difference between computational time and IO is too great. In order maximize parallelism, the block size should be expanded to make computational time equal to IO (approximately 400Bytes).
send command | latency (50µs)in parallel with compute (50µs) | inb | total |
5 µs/400 | 50 µs/400 | 400 µs/400 | 1.14 µs |
At this point, our processing latency becomes 455/400 µs/B, which is a significant improvement. This is the most optimized speed for our case, because IO and computing time are equal. The fastest time we could possibly achieve using the above methods is 1µs/B, which is the time it takes for one PIO call. At this point, the PIO speed becomes the bottleneck.
Method 2: Direct Memory Access
We can address the PIO bottleneck by using DMA. Using the optimized 400B sector, we get:
send command | latency (50µs)in parallel with compute (50µs) | inb | total |
5 µs/400 | 50 µs/400 | 0 µs/400 | 0.14 µs |
This method removes the expensive PIO per byte in our process. However, it still retains the overhead required to call the interrupt statement, which could potentially bottleneck as well. In order to solve that, we apply:
Method 3: DMA + polling
This method uses DMA and polls the disk controller after computation has finished. For an optimized block size so that the program is well timed, there is minimal cost to the CPU to use a busywait. We can avoid the overhead of calling the interrupt handler and increase throughput. This increases disk utilization to 50/55, or 91%.
Method 4: Multitasking
In methods 1-3, we assumed that the file system had one CPU and IO device to execute to one program. They are useful for single-minded applications. In practice, we are more likely to have many different applications that we cannot control individually. Instead, we use multitasking.
// input done via this code snipped
while (inb(0x1f7) & 0xc0 != 0x40) { //while disk not ready
continue;
yield(); // invoke scheduler
// give the CPU to a thread that can use it
}
For a system with multiple processes, the best way to maximize utility is multitasking: if one process cannot do anything useful with the CPU, it hands it off to another. This method becomes so powerful that the previous examples become not necessary.