CS 111 Winter 2012

Lecture 10: File System Performance

by Daniel Ichwan and Seungdo Lee for a lecture by Professor Paul Eggert on February 15, 2012 at UCLA

Storage Devices

In this lecture, we handle the topic of file system and its performance. To understand the performance for the overall system, we will first look at its low level building blocks, the secondary storage device. We observe two different types of storage devices: the disk drive and flash memory.

	Seagate Barracuda XT 3TB 7200RPM Hard Drive		Corsair Force 60GB SSD
*Read Seek Time*	8.5ms	*Read Speed*	up to 280MB/s (sequential)
*Write Seek Time*	9.5ms	*Write Speed*	up to 270MB/s (sequential)
SATA Transfer Rate to Bus/Cache	6 Gb/s	Random Data Transfer Rate	700 Mb/s
Sustained Data Transfer Rate	130 Mb/s
Power Consumption	9.23W (operating) 6.39W (idle)	Power Consumption	2.0W (max operating) 0.5W (idle)
Operating Shock	63 Gs for 2ms	Operating shock	1000Gs for 2ms
Non-Operating Shock	300 Gs for 2ms	Non-Operating Shock	1000Gs for 2ms
Price per GB	$290/3TB = $0.10/GB	Price per GB	$143/60GB = $2.40/GB

(A) Disk Drive

Random Seek Time: the time for the mechanical read head to get to the correct track
Random Write Time: the time for writing to a random track
- Note that the random write time is longer because the write process must be precise; the actual writing process takes longer and the disk cannot accelerate as fast. Therefore, in device specs, the read seek time is used.
Unrecoverable read errors - there are innate material impurities that corrupt data that was written to the disk; however, redundancy measures prevent these errors from occurring. The probability of an unrecoverable read error is about 1 in 10 ⁹
- Calculated by looking at the underlying technology of the storage device
- Average time until unrecoverable read error = 10¹⁴/(10⁵/s) = 10⁹ secs = 30 years
The CPU and device controller (connected to storage device) communicate across a bus.
Why is the sustained data transfer rate (138 Mb/s) so slow? The disk drive cannot receive data at a faster rate due to the physical limitations of seeking. If you want to increase the spin rate, to say 15,000 RPM, the disk drive becomes more expensive.
Inside a disk drive

Image source: http://gerardnico.com/wiki/data_storage/chs

(B) Solid State Device (SSD)

SSDs act like disk drives but are internally based on NAND flash technology.

Flash have slower random (versus sequential) reads/writes
Flash memory wear out if keep writing to the same location - so one shouldn't write to the same block over and over again. Possibly why writes are slower than reading in SSD because need to determine write location based on a wear-leveling algorithm.

Wear-Leveling Algorithms

These algorithms prevent favoring blocks when writing to flash memory. These algorithms are either in (A) device controller or (B) in OS/CPU.

(A) Device Controller
In a device controller, the wear-leveling algorithm is independent of OS scheduling. Both the disk and the SSD will look like a disk drive. But if the OS wants to use the real disk efficiently, this abstraction has performance issues because disk drive optimizations will be wasted if the device is actually an SSD.
(B) IN OS/CPU
If the wear-leveling algorithm is implemented in the OS, the device can be identified as a SSD or a disk drive; therefore, specific device optimizations can be done, but the OS may be a source of complexity and bugs.

(C) Hybrid Storage Device

If you want both the benefits of having a large memory non-cheap disk drive and fast transfer rates of the SSD, a "hybrid drive" can be created by attaching an SSD and disk drive to a mutual device controller. The device controller can initially write to the SSD and later, when the system is not busy, the data can be transferred from the SSD to the disk drive
In this setup, the SSD acts as a stable cache for the hard disk; whereas RAM is an "unstable" cache because it will lose data if there is power loss.
But since hybrids are expensive, one can merely buy a group of disks and masks as RAID (redundant array of independent disks)
Note: For pedagogical simplicity - only one non-hybrid device is chosen.

Random Read/Write Performance


	0	char buf[8192];					// buffer: 16 sectors of 512 bytes
	1	for (;;) {
	2	   lseek (ifd, random_location, SEEK_SET); 	// set the read offset to a random location
	3	   n = read(ifd, buf, sizeof buf);		// ifd = input fd
	4	   n1 = compute(buf, n);			// a generic process that will transform the buffer without changing its size
	5	   lseek(ofd, other_random_location, SEEK_SET); // set the write offset to a random location
	6	   write(ofd, buf, n1);				// ofd = output fd
	7	}						// assume raw drive

For the given code above, we will assume a few things:

Using a raw drive and independent disks (one for reads, one for writes)
n = n1
n = sizeof buf
Application instructions: 0.01μseconds
Syscall overhead: 1μsecond
I/O instructions: 10μseonds

We can observe the latency of each computation from the code above in the following breakdown:

Line 2: overhead = 4.16ms latency + 8.5ms average seek time
Line 3: 4.16ms latency + 10.11μseconds (overhead for reading)
Line 4: average 10ms
Line 5: write latency = read latency + 1ms

Total latency per iteration of random read/write loop

Latency	2(4.16ms) = 8.32ms
Seek	8.5ms + (8.5ms+1ms) = 8.32ms
Syscall	0.04404ms
Computation (using 8KB buffer)	10ms

= 8.32ms + 18ms + 0.04404 ms + 10ms = 36.36404 ms per iteration
→ Bytes per second: 8192 bytes/.03636s per iteration = 225,302 bytes/second (This is terrible!)

Average rotational latency: average time to read off a sector from disk drive (from time of request to time of completed read)

Improving Read/Write Performance

Random read/write performance is usually terrible. There are many ways to improve this performance; we will discuss the bolded ones in detail:

Sequential Read/Write Performance
Batching
Multitasking
Direct Memory Access (DMA)
Direct Memory Access (DMA) with POLLING
Caching
Pre-fetching
Speculation
Dallying Writes

(A) Sequential Read/Write

To minimize seek time, this process should be sequential. This can be done in the code by removing the lseek() calls, which retrieved the data from random locations, allowing the new code to read/write sequentially:


	0	char buf[8192];
	1	for (;;) {
	2	   n = read(ifd, buf, sizeof buf);
	3	   n1 = compute(buf, n);			
	4	   write(ofd, buf, n1);				
	5	}						// assume raw drive

The disk drive will still occasionally have to seek when moving to a different track ("track-to-track seek time")
Say the track-to-track seek time = 1.8ms for a disk drive with 1000 tracks/platter with 1TB/platter
→ 1.8ms per GB → since the application is using 8kB → 1.8ms/GB * (1/125GB) = 0.0144ms seek time → total seek time = read + write time = 2(0.0144ms) = 0.030 ms
Note that the total latency per iteration is significantly reduced from the random read/write as the lseek() seek time is removed and the system call latency is halved.
The read latency time still exists in a sequential model because each time a read is done, an intermediate computation is done (e.g. compute()) and then another read must be done at the beginning of the next iteration.
Total latency per iteration of sequential read/write loop

Latency 2(4.16ms) = 8.32ms

Seek (track-to-track) 0.30s

Syscall 0.02202ms

Computation (using 1MB buffer) 10ms

= 8.32ms + 0.30ms + 0.02202 ms + 10ms = 18.372ms per iteration

(B) Batching

The concept of batching is quite simple - use larger blocks/buffers when reading/writing:


	0	char buf[1024*1024];		//buffer: now 1MB (vs 8kB)
	1	for (;;) {
	2	   n = read(ifd, buf, sizeof buf);
	3	   n1 = compute(buf, n);			
	4	   write(ofd, buf, n1);				
	5	}							// assume raw drive

Total latency per iteration of batched read/write loop

Latency	2(4.16ms) = 8.32ms
Seek (track-to-track)	0.030ms
Syscall	0.02202ms
Computation	*(10ms/8kB)(128kB/MB) = 1280ms**

= 8.32ms + 0.30ms + 0.02202 ms + 1280ms = 1288.64202 ms per iteration → since each iteration processes a MB → 1MiB/1288.62 ms = 813,000 bytes/s

Note that while larger blocks are generally better, if the block size is too big (e.g. 1GB), batching may be disadvantageous:
1. Data cannot be contained in Level 1/Level 2 cache → communication is now slower since the operation must communicate with RAM
2. More awkward track management - the large batch might have a read that ends at the end of a track and must therefore seek another track
3. Law of Diminishing Returns - one will not reduce the latency per byte after a certain point

(C) Multitasking

When reading, also schedule something else that can be done simultaneously: (i.e. in read's implementation: read(...) { ... schedule() ... }).

Multitasking could possibly eliminate effective latency and seek time if scheduled properly, but adds the penalty of context switches.

Total latency per iteration of multitasked read/write loop

*Using non-DMA and only 8KB buffers (vs 1MB)

Latency (multitasking)	0
Seek (multitasking)	0
Syscall	0.02202ms
Computation (using 8KB buffer)	10ms
Context Switches (multitasking)	.1ms/swich * (4 switches) = 0.4 ms

= 0.02202 ms + 10ms + .4ms = 10.42202 ms per iteration

→ since each iteration processes 8 KiB → 8192/10.42202 ms = 800,000 bytes/s (comparable byte/s rate to batching)

(D) Direct Memory Access (DMA)

In DMA, don't use the programmed I/O approach (see Lecture 2) (CPU intermediates reads from disk controller to memory), but now the disk controller talks directly to RAM. Non-DMA approaches have the added latency of 0.5ms for CPU mediation. This leads to the file system acting as a level of indirection between the application and the device controller:


	0	char c;
	1	for (;;) {			// read/write one byte at a time
	2	   read(ifd, &c, 1);		 // each read = seek + latency + etc = ~14ms
	3	   write(ofd, &c, 1);
	4	}

(E) Direct Memory Access (DMA) with POLLING

If the DMA approach is used without context switching (i.e. polling only), there is only one process with asynchronous I/O. This allows for the elimination of the latency from context-switching.

(F) Caching

In caching, when a read is done, the entire sector is retrieved preemptively. So if more bytes are read from that sector later on, they may be quickly retrieved from the cache (versus actually reading from the disk).

(G) Pre-fetching

If the file system sees a read from a given sector, the next few sectors will be read preemptively to reduce overhead. This approach is based on the Principle of Spatiality (a special case of speculation (see next section)), which predicts data nearby recently fetched data will probably be used as well.

(H) Speculation

In speculation, the file system anticipates what the application will do/need next and retrieves the necessary data preemptively.

(I) Dallying Writes

In dallying writes, a write to the disk will instead be written to cache. The file system will eventually write this cached data to the disk when there is CPU downtime. This dallying approach allows the file system/CPU to hold off on longer writes and instead return quickly and do more useful work. An added side effect of dallying writes is that reads of the dallying data will be faster as it is stored in cache and does not have to be read from disk.