Metrics
Technique to Improve
- Batching
- Interrupts
- DMA
- DMA + polling
Typical disk
- Performance
Improve performance
- Scheduling

Lecture 10: File Systems

Authors: Brandon Downs, Yunzhong He, Kyoseong Ku, Zhuo Li

File system is a data structure that lives in memory and secondary storage devices.
When designing a data structure that lives in different places, we ideally want good performance from a file system.
How do we measure "good"? How can we tell if a file system is performing well?

Metrics for performance

The way we measure a file system's performance is by looking at its latency/response time, throughput, utilization, and capacity.

Latency is the difference between the resquest and response times, which we would like to minimize.

Throughput is the measure of rate of request completion, or how many requests are satisfied per second. Usually throughput would measure something meaningful, such as the number of I/O o perations satisfied per second.

Utilization is fraction of time that underlying system is doing useful work, which is related to throughput. Utilization that is as close to 100% as possible is desired.

Capacity is how much data the fs can hold. In order to expand capacity, we can get a file system to have more space than its physical capacity by using compression. For example, we can have 2TB on a 1TB disk.

Latency and throughput are the most important, and there are two other low-level topics to consider: fragmentation and the transfer rate between one part of a file system to another. Reducing fragmentation is ideal, but the reason why this is considered low level is because the way the files are scattered among the disk is irrelevant to the user. What the user cares about is that the files exist and are accessible. Similarly, the transfer rate between one part of a file system to another is low level because what the user cares about is how much data is appearing at the user level.

Low latency and high throughput are desired.

Techniques to improve latency and throughput

Assume a device with following characteristics

CPU cycle time of 1 ns
Programmed I/O (inb/outb operations) takes 1 microsecond
- When sending a command, need to issue several outb instructions
Device latency = 50 microseconds
- When sending a command to device, assume 5 programmed I/O instructions, so 5 microsecond
Computation per buffer = 5 microseconds = 5000 cycles = 5000 instructions
Read 40 bytes from device (inb instructions) = 40 programmed I/O = 40 microsecond

			for(;;) {
			  char buf[40];
			  read 40 bytes in buf;
			  compute(buf);
			}

Latency = send command (5 microsec) + device (50 microsec) + read (40 microsec) + compute (5 microsec) = 100 microsec for each loop

Throughput = 1/latency = 10,000 requests per second

CPU utilization = (5 microsec compute operations) / (100 microsec for each loop) = 0.05 or 5%

Improve performance by increasing buffer size and batching

Batching

Instead of reading only 40 bytes at a time, read multiple

			enum{BATCHSIZE = 21}  //To read 21 buffers at a time
			for(;;) {
			  char buf[40*BATCHSIZE];
			  read 40*BATCHSIZE bytes into buffer;
			  for(int i = 0; i < BATCHSIZE; i++)
					  compute(buf + 40*i);
			}

Latency = send command (5 microsec) + device (50 microsec) + read (40*21 = 840 microsec) + compute (5*21 = 105 microsec) = 100 microsec

Average per-buffer latency is less. 1000 microsec is for entire batching.

Throughput = batched version is not 1/latency

Request size = 21 buffers' worth and getting 1000 requests per second, but requests are 40 bytes, so 21000 40-byte req/s

CPU utilization = 105/1000 = 0.105 or 10.5

According to the computation above, the overall performance improves if we consider CPU utilization, but the downside is the latency goes up. The question posed here is can we be faster that is better than batching? The answer is yes by using interrupts.

Interrupts

The old way:

			for(;;) {
			  write a command to device;
			  while(!ready)
			    continue;
			  read data from device;
			  compute();
			}

The while loop is a bottle neck. While sitting around and waiting, we should/could get some work done. We can compute previous buffer. Since we can't compute data that we haven't read, we can compute on data that has been read before while we wait for next I/O.

Replace the while statement above with this

			do {
			  wait for interrupt;  // CPU just stalls here
			  handle interrupt;
			} while (!ready);

Latency = send command (5 microsec) + wait for interrupt block (50 microsec) + overhead of handling interrupt (5 microsec) + check ready (1 microsec) + read (40 microsec) + compute (5 microsec) = 106 microsec

Throughput = 1/(56 microsec) = 17,857 req/s

CPU utilization = 5/56 = 8.9%

In order to further improve the performance, we can also combine interrupts and batching.

Direct memory access (DMA)

Device controller sends data to memory directly

			for(;;) {
			  issue a write command to the device;
			  block until interrupt;
			  handle interrupt;
			  check that device is ready;
			  compute;
			}

Latency = block (50 microsec) + interrupt handler (5 microsec) + check (1 microsec) + compute (5 microsec) = 61 microsec

Throughput = 1/11 = 91,000 req /s

CPU utilization = 5/11 = 45%

Now, we cannot compute faster, so to improve performance we need to get rid of handler.

DMA + polling

Running program flat out as fast as possible

			while(DMA slots not ready)
			  schedule();

Insert the two lines in place of "block until interrupt" above in DMA. Give the CPU to another process so that it can do something useful. This avoids the overhead of interrupt handler. The idea of DMA + polling is try to keep CPU busy doing useful work. If it is not doing useful work, give the control to other processes. Here, we assumed schedule()'s cost is 0 above, and assume 2 microsec below. Overall, DMA + polling is still better even if schedule()'s 2 microsec latency is considered into the calculations.

Latency = block (50 microsec) + check (1 microsec) + compute (5 microsec) = 56 microsec

Throughput = 1/(6 microsec) = 166,667 req/s

CPU utilation = 84%

METHOD	LATENCY (microsec)	Throughput (40KB/s)	CPU utilization (%)
Naive	100	10	5
Batching	~1000	21	10.5
Interrupts	106	~18	8.9
DMA + interrupts	61	91	45
DMA + polling	56	167	84

If polling is used, we assume that all processes are well-behaved.

Linux-like systems use interrupts because we can't assume that all processes are well-behaved.

The two main devices used for I/O = disk and flash

Typical disk performance

Device = Seagate Barracuda ES2 1TB
Actual run-of-the-mill device
These tend to be used in storage farms
These are "nearline storage" products, not intended to be super high performace, but the next level down
Controller has 16 MiB cache
4.16 ms average rotational latency
7200rpm drive = 120Hz
- The amount of time it takes to do 1 rotation = 8.33 ms
- About twice the average latency
- Average latency = assume data is halfway away from read head
Read seek time 8.5 ms average
Write seek time 9.5 ms average
- The two differ because write is harder than read
  - The write heads need to be positioned more accurately than read heads
Track-to-track 0.8 ms read, 1.0 ms write
1287 Mb/s (megabits) max internal transfer read
116 MB/s (megabytes) max sustained transfer rate (0.93 Gb/s)
- Sustained means if we keep reading
3 Gb/s external transfer rate (SAS)
- Can't really get data this fast from disk, maybe in short bursts. Actual is 0.93 Gb/s
12.5 Watts for typical operations, 9W when idle
1.2 million hours mean time before failure (MTBF)
- If run 24/7, there is AFR chance that it will fail in a year
0.73% AFR = annualized failure rate
- When we write data to disk then read data back from disk, if there is 1 bit error, it recovers and gives correct data
- If there are enough errors on disk, the error-correcting codes won't be able to correct

How to get high performance file system with disk

The most bothersome figures are 4.16ms average latency and 8.5ms average for read seek. So can we only worry about the 8.5 ms value because we can move track while searching rotationally? The answer is no, because it is hard to overlap the two operations since we don't know where to be reading from ahead of time. The two has to be added which is approximately 13ms.

Basic ideas for improving performance

Prefetching is getting data from device (into a RAM cache) before application requests it so that if application is reading, assume subsequent sectors will also be read and move them to cache. This can be done in application (~batching), OS, or disk controller. Prefetching can hurt performance by wasting RAM, tying up the disk while doing prefetching, and then find out data wasn't needed.

Speculation is generalization of prefetching, which in another word is guess what future requests will look like. The idea is to do more work to satisfy future requests. Speculation can hurt performance sometimes.

Batching is common use of prefetching (for reads).

Batching (for writes) is cache what you want to write in kernel, but don't actually store in disk yet, and wait until buffer is filled before writing to disk. However, the downside is data is not actually written--not persistent storage.

Dallying is wait for a similar order. The idea is don't write immediately, wait for a dallying interval 10ms or so, and wait for another write command and execute both together. The downside, however, is data is nto actually written--will lose 10ms of data or so.

Properties of application assumed in order to do speculation:
- Locality of reference = when data is accessed, applications tend to access other data nearby
  - Spatial locality = if i-th block is accessed, (i+delta)th block is likely going to be accessed, where delta is small
  - Temporal locality = if you acces a block at time t, there's a chance that you'll acces the block again at time t+delta, delta small
  - The average seek time value does not consider locality of reference
- Cache coherence = our cache of disk data are accurate
  - Assumption gets harder to maintain as number of CPUs grows

Scheduling access to disk

Now, we assume there are lots of IO requests to a disk, and we have to decide which request to do next. Simple model is a disk is a collection of blocks 0 to n-1, each block is a disk sector that current head position is somewhere in the middle of disk (0 <= h < n), cost of accessing block i is the distance abs(i-h), can do first come first serve FCFS, and can also do shortest seek time first SSTF.

Table of Contents