Scribe Notes.docx

Lecture 10: File System Performance

CS 111 Winter 2012

Written by Sayan Samanta, Saqib Mohammad, Ju Young Park, Lawrence Huang

Introduction

What is a disk?

Steps to reading a sector:

Example Disk – Seagate XT Barracuda 3 TB hard disk drive:

Example Disk – Corsair Force solid state drive 60 GB:

Hybrid drives:

Reading from a disk

I/O function prototype:

Optimized version 1 (sequential access):

Optimized version 2 (batching):

Optimized version 3 (multitasking):

Optimized version 4 and beyond:

Summary of optimization methods

Introduction

A file system is an abstraction of storing data in a method such that there will be consistent, reliable behavior during start up and execution.

During the initial design stages of UNIX disks were faster than their CPU counterparts. Therefore, the premium was to minimize the amount of time spent by the CPU running file system code. The current issue is that CPU’s dominate disks in terms of speed, so the current model has changed to focusing on minimizing the amount of disk access time.

What is a disk?

A disk is…

A serialization of data into binary sequences
Composed of multiple platters stacked on top of each other spinning at a manufactured speed

Each platter is arranged to have tracks of concentric circles containing data organized into sectors

A disk has…

Head to handle reading and writing data from the surface of every platter
Cache to handle data transfer between computer architecture and disk

Steps to reading a sector:

Seeking to the sector:

Move arm to appropriate platter so head can read (if necessary)
Accelerate and decelerate the head so it will begin to read from the correct track

Waiting for the sector:

Once we are aligned we must wait for the correct sector to be underneath the head before beginning to read

Reading the data as it passes under the head -> store it into hard drive cache
Transferring data into main memory -> store in RAM or registers

The overall mechanism for an average read can be as follows:

disk -> cache -> RAM -> CPU

disk -> cache -> CPU

Example Disk 1 – Seagate XT Barracuda 3 TB hard disk drive:

3,907,029,160 sectors
7200 rpm (120 Hz)
64 MiB cache
4.16 ms average latency
8.5 ms average read time
9.5 ms average write time
6 Gb/s interface transfer rate
138 Mb/s max reading rate
Power consumption: 9 W op, 6 W idle
2.8 A max current at 12 V (starting up the drive requires more power than keeping it running)
Shock handling:

operating threshold: 63*force of gravity / 2 ms
idle threshold: 300*force of gravity / 2 ms

1 unrecoverable error in 1014 accesses
$0.10 per GB

Example Disk 2 – Corsair Force solid state drive 60 GB:

Based on NAND Flash, so there are no moving parts meaning that there is no inherent latency caused by waiting for data access to be “ready” (e.g. no seek or waiting latency)
280 Mb/s max reading rate
270 Mb/s max write rate
Power consumption: 2 W op, 0.5 W idle
Shock handling: 1000*force of gravity / 2 ms (another benefit of having not having moving parts)
Mean time between failure: 1 million hours

Flash cells have a finite number of accesses they can handle before they “die”

$2.40 per GB - significantly more than the HDD counterpart
Fine print:

50 kIOP/s is the main bottleneck
Ex. Considering 4 KiB buffers with random, aligned writes performance really becomes ~200 Mb/s

Hybrid drives:

A combined drive with the classic hard drive concepts extended by a SSD component
Operating system recognizes 1 drive, all allocation is handled internally by an adaptive memory algorithm & the disk controller
On a basic level – a disk with a much larger, persistent data cache where the most used data will stay in the cache to accelerate system speed

Reading from the SSD will be very fast
Reading from the HDD will follow the previously described protocols

Reading from a disk

I/O function prototype:

1 char buf[8192];
2 for(;;) {
3 lseek(in_fd, in_offset, SEEK_SET);
4 n=read(in_fd, buf, sizeof(buf));
5
6 n1=compute(buf, n);
7 lseek(out_fd, out_offset, SEEK_SET);
8 write(out_fd, buf, n1);
9 }

Let’s analyze the time spent in each line of code:

line 3: on average, it takes 8.5 ms for read head to find the correct track.
line 4: 4.16 ms for correct location on disk to rotate to read head (10 microseconds for the actual read)
line 6: time it takes to process the data. Assume that in this program it takes 10 ms.
line 7: same as line 3, but +1 ms for accuracy in finding track
line 8: same as line 4, but writing to disk

Time for each loop iteration:
  8.32 ms        latency
  18 ms                seek time
  0.04404 ms         syscall overhead
  10 ms         compute time
-------------------------------------------------
= 36.36404 ms / iteration
Each iteration covers 8192 bytes, we have a performance of 8192 / 0.03636 ≈ 225 kBps

This is not a very efficient way to do file I/O, and we propose some basic optimization ideas:

Sequential access, instead of random. This minimizes seek time.
Batching. Instead of 8192-byte buffers, use 1024*1024bytes, for example.
Multitasking. Schedule something else instead of waiting for hardware.
Direct memory access (DMA). Don't use Programmed I/O, takes away a little more.
DMA without context switching. Just 1 process (asynchronous I/O) polling.
Caching. File system caches most recently read sector.
Pre-fetching. Have the file system read a few extra sectors.
Speculation. A more general form of pre-fetching and dallying.
Dallying. Write to cache instead of directly to disk

Optimized version 1 (sequential access):

remove all lseek by accessing drive sequentially

1 for(;;){
2 n=read(fd,buf,sizeof(buf));
3 n1=compute(buf,n);
4 write(old,buf,n1);
5 }

Assuming the device is a RAM drive, we now have a seek time of almost 0 (except for initial track-to-track seek time). Suppose our drives have 1000 tracks/platter, and the hardware seek time is 1.8ms for each 1-GB track. This means we will spend an average of 0.0144ms per 8 KiB, so we say seek time is now about 0.015 ms for each access (0.03 total). In this implementation there are 2 fewer system calls, also cutting that time in half (0.02202ms).

Loop iteration time after implementing sequential access:
8.32 ms        latency
0.03 ms        seek time (down from 18 ms)
0.02202         ms syscall
10 ms         compute time
-------------------------------------------------
= 18.372ms / iteration on average, which is about half of what we needed in our prototype.
We now have a performance of about 450kBps!

Optimized version 2 (batching):

Use 1MB buffers instead of 8KiB

1 char buf[1024*1024];

Loop iteration time:
8.32 ms        latency
0.03 ms         seek
0.02202 ms         syscall overhead
1280 ms         compute time
-------------------------------------------------
= 1288.37202 ms / iteration, but we are processing 1MB instead of 8KiB of data.
Therefore, we now we have about 795kBps.

Why not use even larger buffers? Say 1GB for example?

L1,L2 cache overflows: this is going increase latency by more than x100.
Awkward track management, especially when we are reading across multiple tracks
Law of diminishing returns, going to 1GB won't save much time

Optimized version 3 (multitasking):

Let CPU do other computations while waiting for disk I/O (assuming 8KiB buffers)

Since we know there is going to be a few milliseconds of latency on the read/write system calls, we can schedule another process and perform useful computations while waiting. Under this implementation, there will no longer be seek delay or latency issues, because we are never idling. However, this does introduce an extra context switching overhead of, say, 0.4 ms.

Total time for loop
0.02202 ms system call overhead
10 ms compute time
0.4 ms context switching overhead
-------------------------------------------------
= 10.42202 ms per iteration of 8KiB. So after now we have about 800kBps throughput.

Optimized version 4 and beyond:

We can use better caching or speculating methods, which are heavily dependent on the specifications of hardware and application that is being run. However, since the unavoidable data computation takes up almost 96% of our total computation time, further optimizations are not going to introduce dramatic performance improvements.

Summary of optimization methods

Optimization methods	Original throughput	Optimized throughput	Speedup rate
Original (unmodified)	225 kBps	N/A	N/A
Sequential Access	225 kBps	450 kBps	200%
Batching at 1MB	450 kBps	795 kBps	177%
Multitasking	450 kBps	800 kBps	178%
Others	800 kBps	N/A	N/A (low)