Lecture 10 Scribe Notes 05-04-2010
by Allen Liang
File System Performance
Some Historical Background of Disk Performance
most commercial file system on disk
disk speed is faster than CPU speed when Unix is designed in 1970
design attributes different, old ones becoming obsolete
nowadays CPU speed is faster (grows exponentially by Moore's Law)
Speed Vs. Time graph of CPU Vs. Disk
What's A Disk?
file system = set of blocks on disk that represent Unix file hierarchy
typical disk rotational speed is fixed at 7200 rpm
120 Hz
to read/write disk, there are 2 data paths
disk <=> disk buffer <=> RAM <=> CPU (slow)
disk <=> disk buffer <=> CPU (faster)
Why Disks Are Slow
steps to read an arbitrary sector
SEEKING
accelerate heads towards where sector will be
decelerate heads, settle them over the correct track
WAITING
wait for sector to be underneath heads
READING
read data as it goes underneath heads
TRANSFERING
transfer data to the main memory
Humdrum Drive Numbers
Seagate Barracuda Es2 1TB
speed + latency
16 MiB cache (disk buffer)
7200 rpm (120 Hz)
which has 8.333 ms rotation time (first) and 4.166 ms rotational latency (later time)
8.5 ms average read seek
9.5 ms average write seek
must done more precisely
takes time to decelerate to locate the exact write location
0.8 ms track-to-track read seek
can start reading as soon as the head is near the correct track
1.0 ms track-to-track write seek
slower than track-to-track read seek since write head must wait until it is exactly in the correct location
20 ms full stroke seek (time to seek the entire width of the disk, from innermost track to the outermost)
cache and transfer rate
1287 Mb/s max. internal transfer rate (from disk to internal buffer)
speed of transferring data from the platter to the cache
116 MB/s max. sustained transfer rate
speed of sequential disk reads
36 Gb/s external transfer rate
communication speed between cache and bus
byte/sector is configurable
512, 520, 524 and 528
extra bytes for error correction
should not always be trusted
power consumption
the power to run disk is typically 12.5 W and 9.0 W when idle
Disk Failure and Prevention
Failure
The annual failure rate (AFR) (24x7) is 0.73%
Mean Time Between Failures (MTBF) is 1.2 million hours
10E-15 nonrecoverable sector read failure rate
1 read failure per quadrillion reads
disk failure due to heat
for instance, a heated server room
one can solve the heat problem by lowering the room temperature (optimal 50F) but costly (power)
or one can solve the heat problem by simply replace the bad disk (disks are cheaper nowadays)
Prevention
disk starts the process of “parking” by placing the drive head in a position (special track designated to be where the heads will be placed for takeoffs and landings and no data is placed there) that is least likely to cause damage to the drive if it is removed or moved physically or if the power supply (in ms) is pulled
error detection parity bit
error correction byte
Disk Performance - READING
Reading Example 1 (unbuffered)
code |
first time |
later time |
for (; ;) { char c; if (read(fd, &c, 1) <= 0) break; process(c); // 1.0 micro second } |
|
|
Conclusion
120 characters read per second
need to speed up
Batching for Read (Assuming Sequential)
Applying Batching to Reading Example 1 (above)
read whole sector (keep it for next time)
only process time for next read if reading same sector
511 (of 512 read) * 0.001 ms/read = 0.511 ms
1 of 512 read (rotational latency) = 8.3 ms
total time = 8.8 ms per each 512 characters
it gives 60,000 chars/s read
Applying Batching to Seagate Barracuda Disk (earlier)
suppose 1-track batch at 116 MB/s sustained transfer rate (with 8.33 ms rotational latency) = 1 MB/track
to process 1 MB
4.2 ms rotational latency
8.4 ms transfer time
0.8 ms track-to-track seek time
1000 ms process time
total time = 1009.2 ms
it gives roughly 1 MB/s rate
More on Batching... (Pros and Cons)
prefetching = get data from the disk before application asks for it because you guess it's needed (our implementation assumes locality of reference)
is prefetching always a win?
yes, if the data is scattered
in general, NO. Because if you guess wrong, the application wait need it for a long time
Suppose the disk changes after you prefetched
cache coherency
cache accurately reflects the state of the disk
becomes a problem if there are multiple CPUs and RAM
one solution = require all I/O to go through cache
Disk Performance – WRITING
Dallying for Write
dallying
“write” data to cache only
the OS waits a bit before writing to disk
power failure?
assume no power failure
our implementation assumes locality of reference
two accesses probably access nearby in-space blocks
speculation = OS guesses what application will do, does extra work now
prefetching is one example
WRITING Example 1
write(10); // dallying for fast performance therefore cache it (bytes might still not be on disk) close(10); |
Solutions to Prevent Failure from Dallying (New Syscalls)
new syscalls: sync();
schedule entire cache to be written
not very good (does not guarantee it gets done right away)
new syscalls: fsync(fd);
all data + meta data (who owns file? What's the size of the file?) for this file are sent to disk
only returns after bits hit disk
downside is, it is slow
it require function seek(); since data and meta-data are stored in two different areas
new syscalls: fdatasync();
it saves data only
WRITING Example 1 Modified
write(10); // dallying for fast performance therefore cache it (bytes might still not be on disk) fsync(); // to prevent failure from dallying close(10); |
reliability Vs. simplicity
emacs (SAVE)
“undo()” for read and write is possible but expansive and slow
Multitasking
defined as 1 CPU runs multiple processes or threads by context switching
disk is useful but CPU is wasteful
1 CPU runs I/O bounded tasks to improve performance
read_sector(...) { while( (lib(&0x1f7) & 0xc0 != 0x40) ) : : schedule(); // changed from “continue;” to “schedule()” } |
Performance Metrics for I/O
utilization = fraction of the system that's doing useful work (measured in percentage 0-100%)
throughput = rate of request completion (measured in I/O per second)
latency = delay from request to response (measured in ms)
examples strategies + methods
Method |
Latency (in micro second) |
Throughput (in KB/s) |
CPU Utilization (in %) |
polling |
100 |
10 |
5 |
batching |
1000 |
21 |
10 |
interrupt |
100 |
18 |
9 |
DMA+interrupt (disk talks directly to RAM) |
56 |
167 |
45 |
DMA+polling (do not have to save state and avoid interrupt overhead) |
50 |
180 |
85 |