Computer Science 111, Spring 2010

Lecture 10, File System: Performance 5/4/10

By: David Chan, Zev Solomon.

File system: A set of blocks on disk that represent the Unix file hierarchy
Will usually cause us to change our design to improve performance
Most commercial file system data resides on disks (hard disk)
- Speeds aren't increasing as fast as CPU speeds are increasing (not exponential)
- Unix was designed when disk speed was greater than CPU speed
Design attributes different today because CPU speed > disk speed => obsolete
The need for a new file system is orthogonal to the need for a new kernel

Why Disks are Slow

Hard Disk Diagram

Built out of multiple platters rotating around a single spindle
The problem is that the disks rotate around at 120Hz (7200rpm)
- Steps to do a read:
  1. Accelerate heads towards where the sector will be
    1. Decelerate heads, settle them over the correct track <-- Seek
  2. Wait for the sector to be under the heads
    1. Constant speed. Takes a long time to speed up the disk, but we don't slow it down because it would take too long to speed back up.
  3. Read the data as it goes under the head
    1. Copies the data into the disk buffer which sits on the hard drive itself
    2. CPU will read the disk buffer into RAM
  4. Transfer data to main memory
Most hard drive failures are due to hardware failure because of excessive heat

Humdrum Drive Numbers:
- Seagate Barracuda ES2 1TB
  - 16 MiB cache
  - 7200rpm = 120 Hz (8.333 ms to do a full rotation. Slowwwwwwwwwwwww)
  - 8.5ms for avg read seek, 9.5 ms for avg write seek (4.2 ms rotational)
  - 1.2 million hours MTBF (mean time b/t failure)
  - 0.8ms Track to track read seek
  - 1.0ms track to track write seek
  - ? 20ms full stroke seek
  - 1287 Mb/s maximum internal xfer rate
  - 116 MB/s maximum sustained xfer rate
  - 3 Gb/s external transfer rate
  - Bytes/sector 512, 520, 524, 528 (configurable)
  - 12.5W typical
  - 9.0W idle --> most of the power goes into just spinning the disks
  - 0.73% AFR(24x7) --> Annualized failure rate (disagrees with MTBF a lot)
  - 10^-15 non-recoverable sector read failure rate

Suppose we want to run this code:

For (;;) {
	char c;
	if (read(fd, &c, 1) <= 0)
		break;
	process(c)

}

It takes this long:

8.5 ms seek
4.2 ms rotational
.004ms xfer
It takes us 12.7ms per character
8.4 ms avg (2x rotational, because the disk has to make another full rotation, no seek time)
We run this for ~120 characters/sec (REALLY SLOWWWWWWWWWpoke)

How do we make this faster?

Batching

Prefetching: get data from the disk before the application asks for it because we are guessing that it is going to be needed
- It is not always a win because if you guess wrong, the application will need to wait a long time for more data
- Suppose the disk changes after you fetch?
  - Cache Coherency
  - One solution: require all I/O to go through the cache
Read whole sector, keep for next time
- 511 of 512 reads .001 ms = .5
- 1 of 512 read 8.3 ms
- Total 8.8 ms to read the sector (512 chars)
- ~0.017 ms/char (60,000 chars/sec)

Bigger Batches

How about we read all of the disk into ram?
- We don't have that much do we…
116 MB/s sustained xfer rate
1 track batch = ~1 MB (166 * .00833)
- To process 1 MB:
  - 8.4 ms to transfer
  - 0.8 track to track seek
  - 8.4 + 4.2 + 1000 = 1012.6 ~= 1 MB/sec
    - This is fast enough so that we are now CPU bound
    - We can now shrink our batch to save memory if we choose to

Will this trick work for writing too?

Prefetching works for reading, but we can't prefetch for writes
Dallying
- The idea: update the cache but don't actually update the disk immediately
- The problem: the cache is still volatile so if there's a power failure, we will have issues.(Will touch on when dealing with Robustness)

Our implementation of prefetching made an assumption:

Locality of Reference

Two nearby in time accesses probably access nearby inspace blocks
- Sequential locality

Speculation

The operating system guesses what the application will do in the future
- Does extra work first to improve performance

Emacs Problem:

When you do a write(), the bytes might still not be on disk
- Once you close, they might just be cached
- To get around this, new system calls:

		sync(); //schedule entire cache to be written
		fsync(fd); //all of data and metadata for file sent to disk (The problem is that it is slow)
		fdatasync(fd); //Only the data

Multitasking

Very helpful for I/O because typically, the processing requirement is rather small

read_sector(…) { //disk is useful, cpu wasted
	while((inb(&0x1F7) & 0xC0) != 0x40
		schedule(); //do this instead of continue

Performance Metrics for I/O

Utilization: fraction of system that's doing useful work %
Throughput: rate of request completion (I/Os / s)
Latency: delay from request to response (ms)

Example Strategies + metric

Method	Latency	Throughput	Utilization
Polling	100 μs	10 KB/s	5%
Batching	1000 μs	21 KB/s	10%
Interrupts	100 μs	18 KB/s	9%
DMA + interrupts	56 μs	167 KB/s	45%
DMA + polling	50 μs	180 KB/s	85%