CS111 Lecture 11

May 6, 2009

by Jeannie Chen and Robert Phung

File System Design

Flash File System

The flash drive is an electronic device with no rotation or moving parts. There are two types of flash drives: NOR and NAND. The NOR drive is an older, more mature, and cheaper type. It is faster at reading data. The NAND flash drive is newer and has bigger capacity. It is faster at writing data.

Flash drives can only change 1 bits, but not 0 bits. It takes the AND operation of what is being written and what is already on the drive. In addition, an erase changes all bits on a block to 1, which is really slow. For NOR drives, the erase operation takes seconds. It takes 100,000 erases on a block for a NOR drive to wear out. For NAND drives, an erase can take milliseconds and the drive wears out after 1,000,000 erases. After 1,000 reads, the data on that block on a NAND drive will be lost. However, if you do 999 reads on a block and then perform a write, then that block is good for another 1000 reads before the data is lost.

Flash drives will always have bad blocks, which are damaged sectors. The drive will still work, but the bad blocks just are not in use.

Flash Drives in the Market

Sun 32GB NAND Flash Drive	Sun 1TB 7200RPM SATA Hard Drive
$960	$660
Read: 35,000 IOPS	Random: 80 IOPS
Write 3,300 IOPS
Random 4kb	Capacity: 30x Speed: 30x slower
2.5W (max)	Capacity: 30x Speed: 30x slower
2.1W (typical)
0.1W (sleep)
MTBF: 2 million hours (assuming 7243years)

File System in Linux

Flash Translation Layer: It is a block manager of a larger array, consisting of garbage collector. Implements a wear leveling algorithm that prevents hammaring on a same block. For example, if the block 57 was already written, then it will write to another block instead. This is all done in the hardware level.

File System:

[x][x][?][?][x][?][x][?][?]
[x] - block is in use
[?] - unused block

Garbage Collector: The garbage collector will leave the [?] alone in normal file system. However, in flash file systems, the garbage collector will erase the unused blocks in the background.

If the user is paranoid about the garbage collector in the FTL, which is hardware, there is the UBIFS, which is implemented in software, that communicates with the device driver directly. However, it does take up bit of the CPU.

Performance Improvement Strategies

Batching

Batching is reading x sectors at a time, where x can be any number such as 1, 8, 16, or even 1,000,000. Reading 1,000,000 sectors at a time can have great performance. However, it lends itself to internal fragmentation and external latency because we often do not read that many sectors. To fix this problem, we can read variable size blocks; different sizes for different operations, which leads to complexity. Therefore, we usually read 8 or 16 sectors at a time, which solves the complexity of reading variable size blocks.

READING

Prefetching: The process of guessing what blocks the application will need next and fetching that block before the application asks for it. The prefetched blocks are put into the cache. This only works well for sequential reads.

Speculation: When the hardware guesses where the next read is. This is implemented at the instruction level.

WRITING

Dallying: Dallying waits for more requests before it starts writing. Then, it starts writing everything together. Unfortunately, this might lead to data loss. Data loss can result from power failure when all the data to be written is still in the cache, which is permanently loss during power failure.

In order to prevent data loss, we give the application a way to tell the operating system not to dally by flushing out all cached data onto the disk.

Functions:: sync() – Schedules all the data in the buffer to go to disk. This is not instantaneous.; fsync(fd) – Write all data cached for the fd immediately.; fdatasync(fd) – Commits all data cached for the fd onto the disk. No metadata is stored.

Performance

Consider the following sample code:

1	for( int i=0, i < N, i++ )
2	{
3		read(fd, buf, 8192);		// drive 1
4		compute();
5		write(fd, buf, 8192);	// drive 2
6	}

No batching or prefetching: 13(read) + 1(compute) + 14 (write) = 28 ms
Prefetching 1 block on read: Fetch a block while writing; 0(read) + 1(compute) + 14(write) = 15 ms
Prefetching and Dallying (1 block): Fetch a block and compute while writing at the same time; 0(read) + 0(compute) + 14(write) = 14 ms; Bottlenecks occur when the data are copied to the write cache.
What if we use 16 blocks instead of 1 block for reading and writing: reduce all time (for 1 block) by a factor of 3; 13(read) + 16(compute) + 14(write) = 43ms / 16 blocks = 5ms/block
We want to reduce the performance time to 1 ms, which is the time it takes for the CPU to compute. This leads to multitasking since the read and write are done in parallel.

Multitasking

Consider T threads each running the sample code above and T input drives and T output drives, all running in parallel at full speed. This keeps the CPU busy so that no CPU processing time is wasted.

Scheduling

The disk is a scarce resource. Scheduling is the process of allocating the disk head to accommodate the input/output requests.

Read/Write Requests

assume cost of seeking from i to j is |i-j|

Types of scheduling algorithms:
First Come First Serve (FCFS): Cost = 500 + 2900 + 2901 + 4001; In FCFS, there is no starvation.
Shortest Seek Time First (SSTF): Cost = 500 + 1 + 2901 + 1100; In SSTF, starvation can occur.
Elevator Algorithm: The elevator algorithm looks for the shortest seek time in the current direction. It is similar to SSTF, but it does not change direction until it reaches the end. This algorithm does not lead to starvation, but it is not fair.
Circular Elevator Algorithm: The circular elevator algorithm is similar to the elevator algorithm but it only goes in one direction.
Anticipatory Scheduling: After performing an input/output operation, wait a teensy bit (e.g. 1 ms) before starting the next I/O operation. Note that anticipatory scheduling is orthogonal to all other scheduling algorithms.; For example, suppose there are two unrelated threads A and B. Thread A wants to write at blocks 3, 4, 5, 6, 7, and 8 while thread B wants to write at blocks 105, 106, 107, 108, and 109. The requests are received in the following order: 3, 105, 4, 106, 5, 107, 6, 108, 7, 8, 109. The order that the requests are processed is very inefficient because the disk head must keep going back and forth between distant blocks. Anticipatory scheduling fixes this inefficiency.