CS111 - Scribe Notes for 05/07/2008

by Matthew Ball, Marc Nguyen, and Kresimir Petrinec

Hard Drive Access

In this lecture, we will deal with the time delays introduced by accessing a physical hard drive. For this lecture, we will assume the following specifications:

Capacity:	120-300 GB	Rotational Speed:	7200 rpm (120 Hz)
Average Latency:	4.20 ms	Rotation Time:	8.33 ms
Full-Stroke Seek¹:	21.0 ms	Track-to-Track Seek:	2.0 ms
Average Read-Seek Time²:	8.9 ms	Average Write-Seek Time³:	10.9 ms
Transfer Rate (Buffer to Host):	3 Gb/s	Transfer Rate (Buffer to Disk):	0.97 Gb/s
Power Dissipation:	~ 8 W

Time to go from innermost track to outermost track
Time to go from some sector to another
Write-seeks must be more accurate than read-seeks

Smaller blocks are more flexible, but are slower when dealing with larger chunks of data. Conversely, larger block sizes are more efficient for big chunks, but they are more prone to wasting a lot of data (e.g. storing a social security number on a 1 GB block). For now, the typical block size is 8 KiB. Therefore:

Read Time = 8192/(0.97/8 * 10^9)

		          seek    latency    read
Read-Seek Time = 8.9 ms + 4.20 ms + 0.07 ms = 13.2 ms (= 13.2 million CPU cycles)

Our Goal

To improve the apparent performance of I/O by exploiting locality of reference (the phenomenon where memory access tends to be clustered in both time and space):

Spatial Locality: A kind of locality of reference in which the reference string contains clusters of references to adjacent or nearby addresses. (i.e. If block i is accessed, then it is likely that block i+1 will be accessed)
Temporal Locality: A kind of locality of reference in which the reference string contains closely-spaced references to the same address. (i.e. If block i is accessed at time t, then access to block i is likely at time t+1)

For example purposes, we will use the following loop to test hard drive access times:


// In real code, would check return statuses of each operation
// Assume blocks at random
for(;;) {
	char buf[8192]; 			// 8 KiB buffer
	read(ifd, buf, sizeof buf);		// read --> 		13.2 ms
	// With compute, want to exploit locality of reference
	compute(buf);				// encryption --> 	1.0 ms (example)
	write(ofd, buf, sizeof buf);		// write --> 		15.2 ms
									// -------
									// 29.4 ms/iteration
}


// 1/0.0294 = 34 iterations/s
// 34 * 8 = 272 KiB/s throughput

Bottlenecks

Seek Time:	19.8 ms
Rotational Latency:	8.4 ms
Transfer:	0.1 ms
Compute:	1.0 ms

General Techniques for Exploiting Locality

When Reading

Speculation: A technique to improve performing an operation in advance of receiving a request on the chance that it will be requested. The hope is that the result can be delivered with less latency and with less setup overhead. Examples include demand paging with larger pages than strictly necessary, prepaging, prefetching, and writing dirty pages before the primary device space is needed. (i.e. Guess where reads will occur, and do them ahead of time while they're cheap. A real life example of speculation is filling up your gas tank. Rather than waiting till your tank is completely empty (leaving you stranded and forcing you to walk to the nearest gas station, you can use speculation and fill up ahead of time before the gas is actually needed).
Batching: A technique to improve performance by combining several operations into a single operation to reduce setup overhead. (i.e. A few big chunks rather than lots of little chunks. A real life example of batching is baking one large cake as opposed to multiple small cupcakes)

Batching + Prefetching

128 blocks at a time

First iteration through the loop:

seek    latency  xfer  comp. write
8.9 ms + 8.33 ms + 0 +  1.0 + 15.2 	= 33.3 ms

Second -> i^th time through the loop:

seek    latency  xfer  comp. write
0.0 ms + 0.0 ms + 0.0 ms + 1.0 ms + 15.2 ms = 16.2 ms

Therefore, average time = 16.33 ms (490 KiB/s)

As you can see, using batching and prefetching, we were able to almost double our performance!

When Writing

Dallying: A technique to improve performance by delaying a request on the chance that the operation won't be needed, or to create more opportunities for batching. (i.e. Delaying work to be done to do it more cheaply later, or not at all)

Buffering 128 blocks at a time

First iteration:

R seek  rot+xfer    comp.   W seek
8.9 ms + 8.33 ms + 1.0 ms + 0.0 ms					= 18.23 ms

Second -> 127^th iteration:

R seek  rot+xfer    comp.   W seek
0.0 ms + 0.0 ms +  1.0 ms + 0.0 ms					= 1.0 ms

128^th iteration:

R seek  rot+xfer    comp.         W seek
0.0 ms + 0.0 ms + 1.0 ms + (10.9 ms + 8.331 ms)				= 20.23 ms

Therefore, average time = 1.28 ms (6250 KiB/s)

As you can see, the combination of prefetching, batching, and dallying allowed us to increase our performance by a total factor of 23!

Another Method: Multitasking


// Original version
ssize_t read(...) {
	while (inb(0x1f7) & 0xC0 != 0x40)
		continue;
}

// Multitasking version
ssize_t read(...) {
	while (inb(0x1f7) & 0xC0 != 0x40)
		schedule(); // Switches to different thread
}

By allowing other threads to run while the disk is busy, we can drastically improve performance!

What Can Go Wrong?

Applications truly want random access to the disk, making it very difficult to exploit locality of reference.
You lose power (etc). If you've dallied in the OS and you told the application that the write succeeded, the disk drive can't save you since you haven't told the disk drive anything yet. Any information that was meant to be written will be lost. Certain possible solutions:
1. Don't dally. Works slowly
2. Power failure trap (not usually practical)
3. Battery backup (power "never" fails) (for NVRAM)
4. Add new system calls for the application to tell OS that the write matters.
  - sync(); // Flushes every buffer (only schedules it)
  - fsync(fd); // Flushes every buffer associated with that file descriptor (does it immediately, not scheduled)
  - fdatasync(fd); // Immediately flushes fd's data buffers (doesn't care about metadata)
  - Note: fsync() and fdatasync() are cheaper than sync() since we don't have to worry about metadata. They can be used by database systems (DMBS)
Multiple processes accessing the same file: Cache Coherence Problem

This is not just a kernel problem!

#include <stdio.h>
getchar	<-- Use a buffer
putchar <-- that lives inside app
fflush 	<-- please sync buffers with OS

Flash Flush

Using flash-based memory brings some benefits and problems:

Random access is cheap (seek time is roughly 0). No seek, no rotational latency
Flash wears out faster
- NORflash wears out after 100,000 writes for block 0
- Wear leveling: Move important blocks around
You must erase before writing. This is very slow.