CS 111

File System Performance

Scribe Notes for Lecture 10 on May 07, 2013

by Esther Tse, Moshe Kahn, Wai Lok Kelvin Wong, Jinhyeuk Baek


Table of Contents
A file system is a data structure which is used to store, retrieve and update a set of files. It can live in memory or secondary storage, so when designing this data structure we need to do it with an eye to speed. We want good performance from our file systems.
Metrics for Performance
Latency* Response time. Delay from particular request to response.
Throughput* Rate of request completion (I/O's/sec).
Transfer rate (lower level) How long it takes data to get from one place to another.
Capacity How much data the file system can hold. You can get more space than physical device can hold by using compression.
Fragmentation (lower level) How much data on actual storage device is visible to you.
Utilization
(closely related to throughput)
Fraction of time that the I/O system is actually doing I/O, can get as close as you can to 100%.
*Think about these two most

Performance Improvements
Let's look at ways we can read data into a device. We will discuss and analyze performance for five approaches:
The Naive Way, Batching, Interrupts, DMA (direct memory access) with Interrupts, and DMA with Polling.

To illustrate the difference in performances, let's assume we have a device with the following characteristics:
CPU cycle time 1ns
Programmed I/O 1µs (1000 cycles) (i.e. inb, outb)
Send command to device 5µs (equals to 5 programmed I/Os)
Device latency 50µs (Time between sending a command to device and command gets executed: 50µs. In other words, 10x more than time to set it up.)
Computation/buffer 5µs (5000 cycles)
Read bytes back
(40 bytes from device)
40µs (equals to 40 programmed I/Os)
The Naive Way
This reads in 40 bytes and then computes that data and does so indefinitely.

for (;;) {
	char buf[40];
	read 40 bytes in buf;
	compute (buf);
}
Latency = send command + device latency + read + compute
= 5µs + 50µs + 40µs + 5µs
= 100µs

Throughput = (1/latency) = 10,000 40-byte-req/s

CPU utilization (amount of time CPU spent computing over total time) = 5/100 = 5%
Batching
Instead of doing little work over and over, let's increase the buffer size and do it in larger batches each time. How big should we make the batch? We chose 21 arbitrarily to make performance calculations nice, but there are drawbacks to making the batch size very large. It requires a lot of memory and the entire buffer must be filled before computations can be performed.

for (;;) {
	enum { BATCHSIZE = 21 };
	char buf[40 * BATCHSIZE];
	read 40 * BATCHSIZE bytes into buffer;
	for (int i = 0; i < BATCHSIZE; i++) {
            compute(buf + 40*i); 
	}
}
Latency = send command + device latency + read + compute
= 5µs + 50µs + (40*21)µs + (5*21)µs
= 1,000µs
Doing 21 buffer's worth, latency for entire batch

Throughput = 21 buffers x 1,000 req/s = 21,000 40-byte-req/s

CPU Utilization = 105/1,000 = 10.5%

Upside: CPU utilization goes up.
Downsides: Latency for very first operation goes up.
Memory overhead.
*Nowdays we are willing to trade RAM to get better throughput / latency / CPU utilization.
Interrupt
Interrupt is often actively performed by a CPU towards a process. When a process is interrupted, the kernel takes back the CPU and it decides which process should be running next. Interrupts take up CPU time. Since we cannot assume all processes are well-behaved or bug-free, sometimes interrupts are necessary to prevent a long waiting time for other processes.

Let's say we have the following code:

for (;;) {
	write command to device;
	while (!ready) continue; 
	read data from device;
	compute;
}
Bottleneck: The while loop does no useful work, in this case the CPU is just sitting and waiting for the data to be ready from the device. Let's get useful computation done instead of just “twiddling our thumbs”.

Solution: Instead of waiting for the I/O to finish, we can overlap I/O with computation. We can't compute the current read because it hasn't finished, but we can compute the previous read. We can do this using signals, which can interrupt the normal sequence of processing. The processor will check for interrupts, and if there is a signal, it will suspend execution of the current code, save the current state, execute the code in the interrupt handler routine, and restore the context before continuing the interrupted code.

Interrupt_image

We can change the previous code to the following for better performance:

for (;;) {
    write command to device;
    do { 
        wait for interrupt; // CPU stalls
        handle interrupt; // assume 5µs
    } while (!ready);
    read data from device;
    compute;
}

Did this change actually help?
Latency = send command + wait for interrupt block + handle interrupt + check ready + read + compute
= 5µs + 50µs + 5µs + 1µs + 40µs + 5µs
= 106µs

Throughput = (1/56µs) = 17,857 req/s
(56 from block + check + compute)

CPU utilization = 5/56 = 8.9%
DMA (Direct Memory Access)
DMA_image

This diagram shows the relation between CPU, devices, memory and the position of DMA.
DMA works in the background without CPU intervention. Hence, this speeds up data transfer and CPU speed.
It is used for moving large files since it would take too much of CPU capacity.

DMA + Interrupt
A device controller transfers data between an I/O device and memory directly, without needing the CPU during the entire process. This is most useful when the CPU can do other useful work while waiting for a long data transfer to finish. Once the transfer is complete, a signal is sent to resume execution.

for (;;) {
	issue a write command to device;
	block with interrupt;
	handler interrupt;
	check that device is ready;
	compute;
}
Latency = block + handler + check + compute
= 50µs + 5µs + 1µs + 5µs
= 61µs


Throughput = 1/(handler + check + compute) = (1/11µs) = 91,000 req/s

CPU Utilization = 5/11 = 45%

We can't optimize the computation, but can we optimize the handler and check? The handler takes about as much time as the computation, so we should get rid of the handler. Optimizing the check isn't really helpful since it takes relatively less time.

Polling
Polling is often in contrast with Interrupt. With polling, the CPU periodically checks each device to see if it needs service. This can be efficient if there is a device ready each time the CPU checks. However, compared to interrupts, it takes CPU time when no requests are pending. Also, the CPU operates at a much faster speed than most I/O devices. So a CPU can get into a busy wait, checking the device many times, even though the device is very slow.

An analogy for polling is picking up your cell phone every few seconds to see if you have a call. In comparison, interrupts are asynchronous, so it's like waiting for the phone to ring.
DMA + Polling

for (;;) {
	issue a write command to device;
	while (DMA slots not ready)
	       schedule();           // give up CPU if not done, let something else run
	check that device is ready;
	compute;
}
Latency = block + check + compute
= 50µs + 1µs + 5µs
= 56µs


Throughput = 1/(check + compute) = (1/6µs) = 166,666 req/s

CPU Utilization = 5/6 = 84%

The benefit of polling is that we avoid the overhead of having the handler.
However, we assumed there is other work that can be done. DMA + Polling don't make sense for single tasking!

The performance calculations also assume the scheduling cost is 0. Let's say scheduling takes 1µs or 2µs.
In this case, throughput and utilization are lower but it's still better than before.

Polling also assumes you and all other processes are smooth/well-behaved. We can't assume all processes are well-behaved (that's only for embedded environments). Some other process is running when the disk is done, and it may not be well-behaved and will not let other processes run for a long time. We should use DMA + interrupts over DMA + polling to prevent a misbehaving process from hogging the CPU, etc.

Performance Summary
Method Latency (µs) Throughput (40 KB/sec) CPU Utilization (%)
Naive 100 10 5
Batching ~1000 21 10.5
Interrupts 106 ~18 8.9
DMA + Interrupt 61 91 45
DMA + Polling 56 167 84
Review of Disk Performance
disk_diagram_image
Typical Computer System Architecture

hard_disk_image
Disk Structure

There are small latencies from the read-write head. Let's say we want data in another track and have to physically move the head.

The disk latency comes from:
1. track-to-track seek time
1a. accelerate the head to correct track
1b. decelerate it as it gets close to target track
2. rotational latency: wait for data to come under head
3. read/write data as it rotates under the head (copy it to/from controller's RAM)
4. transfer data from/to controller's RAM via bus to wherever (CPU, memory)
Typical Disk Performance
Seagate Barracuda ES.2 1 TB
16 MiB cache
4.16ms average (rotational) latency
7200 rpm drive (equals to 120 Hz)
Time to do complete rotation = 1/120 s = 0.008333 s = 8.33 ms
(worst case is twice latency b/c head is usually only half-way from where you want it to be)
(best case is 0 ms)
8.5ms average read seek time
9.5ms average write seek time
0.8ms read track-to-track seek time
1.0ms write track-to-track seek time
1287 Mb/s (0.93 Gb/s) max internal transfer read
116 MB/s max sustained transfer rate (just keep reading without any seek)
3 Gb/s external transfer rate (SAS) (8.33 ms)
12.5 Watt power consumption for typical operation
9.0 Watt power consumption when idle
1.2 million hours mean time between failure (MTBF) (a guess based on experience, other disks)
0.73% annualized failure rate (AFR), chance it'll fail during a full year of use
10^(-15) nonrecoverable sector read failure rate
How to get high performance with disk drive that looks like this?

Bottlenecks : 4.16 ms avg rotational latency
8.5 ms avg read seek
(9.5 ms avg write seek)

Can you rotate and read at the same time and get just 8.5ms for read?

Typically, you have to add the two numbers because they can't be done in parallel.
The problem with performing both at the same time is that you don't know where you will be reading until the request comes in.

Basic ideas for improving performance:
PREFETCHING
Instead of doing a sequential read only when there is a request for it, get more data from the device (into RAM cache) before the request comes in. This can be done in applications, in the operating system, and in the disk controller. Batching is an example of prefetching.

Downside: Prefetching can waste RAM or tie up the disk unnecessarily.
SPECULATION
Even though there's been no request for an operation, let's do it now in hopes that it'll improve performance later. This work chews up some CPU, and if the request turns out to be unnecessary work, it may slow down other requests. Speculation can be done with idle resources to minimize the downside of performing the additional operation, such as moving the disk arm to the middle of an idle drive.

Speculation can also create opportunities for batching. If the system speculates that a read request for block n will be followed by read requests for blocks n+1 through n+8, those requests can be batched.

Downside: Speculation can hurt performance if the work turns out to be unnecessary.
BATCHING
We already discussed batching for read requests, but is it possible to do batching for write requests?

It is possible to cache what you want to write in the kernel and wait until the buffer fills up to write to the disk. However, data is not actually written on persistent storage! If there is a crash, all that data will be lost.

DALLYING
Instead of fulfilling a request immediately, let's have a dallying interval of 10 ms or 100 ms in between our last request and the next. There are two reasons for this delay. The first reason is that there may be a second request a few milliseconds later that would overwrite the results of the first. The second reason is that you may be expecting more requests to come in and they can be batched. This will increase the latency of some requests but it improves the average latency of all requests.
Assumptions
We assume two properties of application to do speculation:
LOCALITY OF REFERENCE
We tend to access data nearby either because we have spatial locality or temporal locality.
Spatial locality : accessing ith block means it's likely to access (i+ δ)th block, where δ is small
Temporal locality : accessing at time t means it's likely to access at time t + δ

However, average seek time assumes no locality of reference.
CACHE COHERENCE
In general, when we have multiple copies of a shared resource stored in caches, we assume our caches of disk data are accurate and changes will be reflected properly in all copies. When we try to improve performance, we are assuming that we have cache coherent processors that will read the most current value that was written by any processor. This gets harder as the number of CPU's grows.

There is no solution right now to ensure cache coherency. Two mechanisms are snooping and directory-based coherence; however, they both have shortcomings. Snooping is not scalable due to higher bandwidth, while directory-based coherence has longer latency.
Scheduling
Another problem that affects performance is scheduling accesses to disk.
Assume there are a lot of I/O requests to a disk, which one should we do next?

Simple Model: 0 _ _ _ _ _ _ N-1 blocks
Head Position: 0 <= h < N

Cost of accessing block i is |i-h|.
Possible Scheduling Policies:

First-Come, First-Served (FCFS)
Shortest Seek Time First (SSTF)

We will discuss the advantages and disadvantages of these the next lecture.



References:
http://dis-dpcs.wikispaces.com/6.5.5+DMA+vs.+interrupts+vs.+polling
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/IO/polling.html
http://www.cs.toronto.edu/~demke/469F.06/Lectures/Lecture6.pdf
Principles of Computer System Design by Jerome H. Saltzer and M. Frans Kasshoek