Lecture 10: File System Performance

CS 111 Winter 2012

Written by Sayan Samanta, Saqib Mohammad, Ju Young Park, Lawrence Huang

Introduction

What is a disk?

Steps to reading a sector:

Example Disk – Seagate XT Barracuda 3 TB hard disk drive:

Example Disk – Corsair Force solid state drive 60 GB:

Hybrid drives:

Reading from a disk

I/O function prototype:

Optimized version 1 (sequential access):

Optimized version 2 (batching):

Optimized version 3 (multitasking):

Optimized version 4 and beyond:

Summary  of optimization methods

Introduction

A file system is an abstraction of storing data in a method such that there will be consistent, reliable behavior during start up and execution.

During the initial design stages of UNIX disks were faster than their CPU counterparts. Therefore, the premium was to minimize the amount of time spent by the CPU running file system code. The current issue is that CPU’s dominate disks in terms of speed, so the current model has changed to focusing on minimizing the amount of disk access time.

What is a disk?

A disk is…

  1. A serialization of data into binary sequences
  2. Composed of multiple platters stacked on top of each other spinning at a manufactured speed
  1. Each platter is arranged to have tracks of concentric circles containing data organized into sectors

A disk has…

  1. Head to handle reading and writing data from the surface of every platter
  2. Cache to handle data transfer between computer architecture and disk

Steps to reading a sector:

  1. Seeking to the sector:
  1. Move arm to appropriate platter so head can read (if necessary)
  2. Accelerate and decelerate the head so it will begin to read from the correct track
  1. Waiting for the sector:
  1. Once we are aligned we must wait for the correct sector to be underneath the head before beginning to read
  1. Reading the data as it passes under the head -> store it into hard drive cache
  2. Transferring data into main memory -> store in RAM or registers

The overall mechanism for an average read can be as follows:

disk -> cache -> RAM -> CPU

disk -> cache -> CPU

Example Disk 1 – Seagate XT Barracuda 3 TB hard disk drive:

  1. 3,907,029,160 sectors
  2. 7200 rpm (120 Hz)
  3. 64 MiB cache
  4. 4.16 ms average latency
  5. 8.5 ms average read time
  6. 9.5 ms average write time
  7. 6 Gb/s interface transfer rate
  8. 138 Mb/s max reading rate
  9. Power consumption: 9 W op, 6 W idle
  10. 2.8 A max current at 12 V (starting up the drive requires more power than keeping it running)
  11. Shock handling:
  1. operating threshold: 63*force of gravity / 2 ms
  2. idle threshold: 300*force of gravity / 2 ms
  1. 1 unrecoverable error in 1014 accesses
  2. $0.10 per GB

Example Disk 2 – Corsair Force solid state drive 60 GB:

  1. Based on NAND Flash, so there are no moving parts meaning that there is no inherent latency caused by waiting for data access to be “ready” (e.g. no seek or waiting latency)
  2. 280 Mb/s max reading rate
  3. 270 Mb/s max write rate
  4. Power consumption: 2 W op, 0.5 W idle
  5. Shock handling: 1000*force of gravity / 2 ms (another benefit of having not having moving parts)
  6. Mean time between failure: 1 million hours
  1. Flash cells have a finite number of accesses they can handle before they “die”
  1. $2.40 per GB - significantly more than the HDD counterpart
  2. Fine print:
  1. 50 kIOP/s is the main bottleneck
  2. Ex. Considering 4 KiB buffers with random, aligned writes performance really becomes ~200 Mb/s

Hybrid drives:

  1. A combined drive with the classic hard drive concepts extended by a SSD component
  2. Operating system recognizes 1 drive, all allocation is handled internally by an adaptive memory algorithm & the disk controller
  3. On a basic level – a disk with a much larger, persistent data cache where the most used data will stay in the cache to accelerate system speed
  1. Reading from the SSD will be very fast
  2. Reading from the HDD will follow the previously described protocols

Reading from a disk

I/O function prototype:

1     char buf[8192];
2     for(;;) {
3       lseek(in_fd, in_offset, SEEK_SET);
4       n=read(in_fd, buf, sizeof(buf));
5  
6       n1=compute(buf, n);
7       lseek(out_fd, out_offset, SEEK_SET);
8       write(out_fd, buf, n1);
9    
}


Let’s analyze the time spent in each line of code:

  1. line 3: on average, it takes 8.5 ms for read head to find the correct track.
  2. line 4: 4.16 ms for correct location on disk to rotate to read head (10 microseconds for the actual read)
  3. line 6: time it takes to process the data. Assume that in this program it takes 10 ms.
  4. line 7: same as line 3, but +1 ms for accuracy in finding track
  5. line 8: same as line 4, but writing to disk

Time for each loop iteration:
  8.32 ms
        latency
  18 ms
                seek time
  0.04404 ms
        syscall overhead
  10 ms
        compute time
-------------------------------------------------
= 36.36404 ms / iteration
Each iteration covers 8192 bytes, we have a performance of 8192 / 0.03636 ≈  
225 kBps


This is not a very efficient way to do file I/O, and we propose some basic optimization ideas:

  1. Sequential access, instead of random. This minimizes seek time.
  2. Batching. Instead of 8192-byte buffers, use 1024*1024bytes, for example.
  3. Multitasking. Schedule something else instead of waiting for hardware.
  4. Direct memory access (DMA). Don't use Programmed I/O, takes away a little more.
  5. DMA without context switching. Just 1 process (asynchronous I/O) polling.
  6. Caching. File system caches most recently read sector.
  7. Pre-fetching. Have the file system read a few extra sectors.
  8. Speculation. A more general form of pre-fetching and dallying.
  9. Dallying. Write to cache instead of directly to disk

Optimized version 1 (sequential access):

remove all lseek by accessing drive sequentially

1    for(;;){
2      n=read(fd,buf,sizeof(buf));
3      n1=compute(buf,n);
4      write(old,buf,n1);
5    }

Assuming the device is a RAM drive, we now have a seek time of almost 0 (except for initial track-to-track seek time). Suppose our drives have 1000 tracks/platter, and the hardware seek time is 1.8ms for each 1-GB track. This means we will spend an average of 0.0144ms per 8 KiB, so we say seek time is now about 0.015 ms for each access (0.03 total). In this implementation there are 2 fewer system calls, also cutting that time in half (0.02202ms).

Loop iteration time after implementing sequential access:
 8.32 ms        latency
 
0.03 ms        seek time (down from 18 ms)
  0.02202         ms syscall
  10 ms         compute time
-------------------------------------------------
= 18.372ms / iteration on average, which is about half of what we needed in our prototype.
We now have a performance of about
450kBps!

Optimized version 2 (batching):

Use 1MB buffers instead of 8KiB

1     char buf[1024*1024];

Loop iteration time:
  8.32 ms
        latency
  0.03 ms         seek
  0.02202 ms         syscall overhead
  1280 ms         compute time
-------------------------------------------------
= 1288.37202 ms
/ iteration, but we are processing 1MB instead of 8KiB of data.
Therefore, we now we have about
795kBps.

Why not use even larger buffers? Say 1GB for example?

  1. L1,L2 cache overflows: this is going increase latency by more than x100.
  2. Awkward track management, especially when we are reading across multiple tracks
  3. Law of diminishing returns, going to 1GB won't save much time

Optimized version 3 (multitasking):

Let CPU do other computations while waiting for disk I/O (assuming 8KiB buffers)


Since we know there is going to be a few milliseconds of latency on the read/write system calls, we can schedule another process and perform useful computations while waiting. Under this implementation, there will no longer be seek delay or latency issues, because we are never idling. However, this does introduce an extra context switching overhead of, say, 0.4 ms.

Total time for loop
 0.02202 ms system call overhead
  10 ms compute time
  0.4 ms context switching overhead
-------------------------------------------------
= 10.42202 ms per iteration of 8KiB.  So after now we have about 800kBps throughput.


Optimized version 4 and beyond:


We can use better caching or speculating methods, which are heavily dependent on the specifications of hardware and application that is being run. However, since the unavoidable data computation takes up almost 96% of our total computation time, further optimizations are not going to introduce dramatic performance improvements.

Summary  of optimization methods

Optimization methods

Original throughput

Optimized throughput

Speedup rate

Original (unmodified)

225 kBps

N/A

N/A

Sequential Access

225 kBps

450 kBps

200%

Batching at 1MB

450 kBps

795 kBps

177%

Multitasking

450 kBps

800 kBps

178%

Others

800 kBps

N/A

N/A (low)