Lecture 10 Scribe Notes 05-04-2010

by Allen Liang

File System Performance

Some Historical Background of Disk Performance

most commercial file system on disk
disk speed is faster than CPU speed when Unix is designed in 1970
design attributes different, old ones becoming obsolete
nowadays CPU speed is faster (grows exponentially by Moore's Law)

Speed Vs. Time graph of CPU Vs. Disk

What's A Disk?

file system = set of blocks on disk that represent Unix file hierarchy
typical disk rotational speed is fixed at 7200 rpm
120 Hz
to read/write disk, there are 2 data paths
- disk <=> disk buffer <=> RAM <=> CPU (slow)
- disk <=> disk buffer <=> CPU (faster)

Why Disks Are Slow

steps to read an arbitrary sector
- SEEKING
  - accelerate heads towards where sector will be
  - decelerate heads, settle them over the correct track
- WAITING
  - wait for sector to be underneath heads
- READING
  - read data as it goes underneath heads
- TRANSFERING
  - transfer data to the main memory

Humdrum Drive Numbers

Seagate Barracuda Es2 1TB
speed + latency
- 16 MiB cache (disk buffer)
- 7200 rpm (120 Hz)
  - which has 8.333 ms rotation time (first) and 4.166 ms rotational latency (later time)
- 8.5 ms average read seek
- 9.5 ms average write seek
  - must done more precisely
  - takes time to decelerate to locate the exact write location
- 0.8 ms track-to-track read seek
  - can start reading as soon as the head is near the correct track
- 1.0 ms track-to-track write seek
  - slower than track-to-track read seek since write head must wait until it is exactly in the correct location
- 20 ms full stroke seek (time to seek the entire width of the disk, from innermost track to the outermost)
cache and transfer rate
- 1287 Mb/s max. internal transfer rate (from disk to internal buffer)
  - speed of transferring data from the platter to the cache
- 116 MB/s max. sustained transfer rate
  - speed of sequential disk reads
- 36 Gb/s external transfer rate
  - communication speed between cache and bus
- byte/sector is configurable
  - 512, 520, 524 and 528
  - extra bytes for error correction
    - should not always be trusted
power consumption
- the power to run disk is typically 12.5 W and 9.0 W when idle

Disk Failure and Prevention

Failure

The annual failure rate (AFR) (24x7) is 0.73%
Mean Time Between Failures (MTBF) is 1.2 million hours
10E-15 nonrecoverable sector read failure rate
- 1 read failure per quadrillion reads
disk failure due to heat
- for instance, a heated server room
  - one can solve the heat problem by lowering the room temperature (optimal 50F) but costly (power)
  - or one can solve the heat problem by simply replace the bad disk (disks are cheaper nowadays)

Prevention

disk starts the process of “parking” by placing the drive head in a position (special track designated to be where the heads will be placed for takeoffs and landings and no data is placed there) that is least likely to cause damage to the drive if it is removed or moved physically or if the power supply (in ms) is pulled
error detection parity bit
error correction byte

Disk Performance - READING

Reading Example 1 (unbuffered)

code

first time

later time

for (; ;)

{

char c;

if (read(fd, &c, 1) <= 0)

break;

process(c); // 1.0 micro second

}

8.5 ms seek
4.2 ms rotational
0.004 ms transfer
total time = 12.7 ms

0.001 ms process
8.33 ms rotational
0.004 ms transfer
total time = 8.4 ms

Conclusion
- 120 characters read per second
- need to speed up

Batching for Read (Assuming Sequential)

Applying Batching to Reading Example 1 (above)

read whole sector (keep it for next time)
only process time for next read if reading same sector
511 (of 512 read) * 0.001 ms/read = 0.511 ms
1 of 512 read (rotational latency) = 8.3 ms
total time = 8.8 ms per each 512 characters
- it gives 60,000 chars/s read

Applying Batching to Seagate Barracuda Disk (earlier)

suppose 1-track batch at 116 MB/s sustained transfer rate (with 8.33 ms rotational latency) = 1 MB/track
to process 1 MB
- 4.2 ms rotational latency
- 8.4 ms transfer time
- 0.8 ms track-to-track seek time
- 1000 ms process time
- total time = 1009.2 ms
  - it gives roughly 1 MB/s rate

More on Batching... (Pros and Cons)

prefetching = get data from the disk before application asks for it because you guess it's needed (our implementation assumes locality of reference)
is prefetching always a win?
- yes, if the data is scattered
- in general, NO. Because if you guess wrong, the application wait need it for a long time
Suppose the disk changes after you prefetched
- cache coherency
  - cache accurately reflects the state of the disk
  - becomes a problem if there are multiple CPUs and RAM
- one solution = require all I/O to go through cache

Disk Performance – WRITING

Dallying for Write

dallying
- “write” data to cache only
- the OS waits a bit before writing to disk
- power failure?
  - assume no power failure
our implementation assumes locality of reference
- two accesses probably access nearby in-space blocks

speculation = OS guesses what application will do, does extra work now
- prefetching is one example

WRITING Example 1

write(10); // dallying for fast performance therefore cache it (bytes might still not be on disk)

close(10);

Solutions to Prevent Failure from Dallying (New Syscalls)

new syscalls: sync();
- schedule entire cache to be written
- not very good (does not guarantee it gets done right away)
new syscalls: fsync(fd);
- all data + meta data (who owns file? What's the size of the file?) for this file are sent to disk
- only returns after bits hit disk
- downside is, it is slow
  - it require function seek(); since data and meta-data are stored in two different areas
new syscalls: fdatasync();
- it saves data only

WRITING Example 1 Modified

write(10); // dallying for fast performance therefore cache it (bytes might still not be on disk)

fsync(); // to prevent failure from dallying

close(10);

reliability Vs. simplicity
- emacs (SAVE)
“undo()” for read and write is possible but expansive and slow

Multitasking

defined as 1 CPU runs multiple processes or threads by context switching
disk is useful but CPU is wasteful
1 CPU runs I/O bounded tasks to improve performance

read_sector(...)

{

while( (lib(&0x1f7) & 0xc0 != 0x40) )

schedule(); // changed from “continue;” to “schedule()”

}

Performance Metrics for I/O

utilization = fraction of the system that's doing useful work (measured in percentage 0-100%)
throughput = rate of request completion (measured in I/O per second)
latency = delay from request to response (measured in ms)
examples strategies + methods

Method	Latency (in micro second)	Throughput (in KB/s)	CPU Utilization (in %)
polling	100	10	5
batching	1000	21	10
interrupt	100	18	9
DMA+interrupt (disk talks directly to RAM)	56	167	45
DMA+polling (do not have to save state and avoid interrupt overhead)	50	180	85