File System performance

CS111 scribe notes Fall 2014
By: Daniel Yao

Basics About File Systems:

File systems need: Disk space, addresses, organization, performance, & reliability.

Example of how files are stored:
For a system of 120PB = 120,000TB. 200,000 hard drives (each 600GB)
         example 1
         example 2

GPFS - General Parallel File System Properties:

        --------------------------
  *Stripes| 1 | 2 | 3 | 1 | 2 | ...
        --------------------------
*Distributed Meta-data (No Central Catalog)
*Efficient Directory Indexing
  Binary Tree or 2-3 Trees
*Distributed Locking (Harder to implement)
*Partition awareness
*File system stays alive during maintenance
example 4

Components important for a file system:


Component Speed Price per unit
CPU Registers Fastest Very Costly
CPU Cache L1 Fast Costly
CPU Cache L2 Less Fast Costly
CPU Cache L3 Slow Costly
RAM Slower Less Costly
Flash Even Slower Cheap
Hard Disk Much Slower Cheaper
Backup Discs/Tape Slowest Cheapest

Maximize bang for buck:


Two ways to measure performance:

*Throughput (bytes/sec) (Read/Write)

*Latency (Turnaround time for 1 request)


The steps to read from a disk

example 3
A Disk typically spins at 5400-15,000 RPM

Send command to controller
Accelerate Platters
Decelerate Platters
Wait until rotation occurs
Get data into disk controller's cache
Copy into RAM or CPU

Comparison between A traditional hard drive and a SSD

Seagate Barracuda ES2 1TB Corsair Force GT (SSD) Description
16 MB N/A Cache - amount of memory the disk drive has
7200 rpm = 120Hz Does not spin Rotation speed - how fast the disk drive spins
8.333 ms N/A Rotation time - worst rotational latency
4,166 ms N/A Average rotational latency
8.5 ms N/A Average read seek
9.5 ms N/A Average write seek - Longer than read seek time because it requires precise positioning
0.8 ms/1.0ms N/A Track-to-track seek - Moving between tracks
1.29 Gb/s N/A Maximum internal transfer rate - Rate at which data comes off the disk and onto the cache.
3 Gb/s 3 Gb/s External transfer rate - Rate at which you transfer data from the disk controller cache to the bus
12.5 W 2.0W Typical running wattage
9 W 2.0W Idle- Average wattage
1.2 Million hours 1.0 Million hours The estimated total lifespan of the drive
0.73% AFR N/A annualized failure rate - Estimated value, yearly failure rate if used 24/7
Non-recoverable read failure rate of 10^-15 percent N/A Probability that the disk will lose the sector
A couple of Gs 1000G Shock resistance
105 MB/s Read / 95 MB/s Write 280 MB/s Read / 270 MB/s Write Transfer Rates


Performance metrics example

~Consider this example device~
1 ns cycle time for CPU
1 μs programmed I/O or 1000 CPU cycles
5 μs to send command 5 PIOS
50 μs device latency
40 μs to read 40 bytes from device (40 PIOS)
5 μs for CPU to process 40 bytes, 5000 cycles

***Again, things we care about:
Throughput - bytes/sec
Latency - delay from request response
Utilization - % of CPU devoted to useful work

for(;;) {
char buf[40];
//*Read 40 bytes from device to buf
compute(buf);
}


Latency = *Send command to controller(5 μs) + device latency(50 μs) + read data (40 μs)
+ compute (5 μs) = 100 μs
Throughput = 1/Latency = 10,000 requests/s
Utilization = 5%

Batching to improve throughput and utilization
for(;;) {
char buf[840];
//Read 840 bytes
compute(buf);
}

By reading more bytes...
Latency = 5 μs + 50 μs + 840 μs + 105 μs = 1,000 μs
Throughput = 21 requests/1ms = 21,000 requests/sec
Utilization = 105/1000 = 10.5%
...We get better overall throughput and CPU utilization.

Multitasking to help improve throughput and utilization

**We are going to let someone else run while we wait for disk
for(;;) {
//Send command to controller
do { Block until interrupt
handle interrupt } while(!read)
//read buffer(40 bytes)
compute(40);
}

By multitasking...
Latency = 5 μs + 50 μs + 5 μs (handle time) + 1 μs (check ready) + 40 μs + 5 μs= 106 μs
Throughput = 1/56 μs = 17,857 requests/sec
Utilization = 5/56 = 8.9%
...We get better overall throughput and CPU utilization.
*Note: We can combine Batching and Multitasking to improve performance further

DMA to improve performance (Tell device where to put data)
Latency = block + handle + check ready + compute = 50 μs + 5 μs + 1 μs + 5 μs = 61 μs
Throughput = 1/11 μs = 91,000 requests/sec
Utilization = 5/11 = 45%

To Speed up even more:
for(;;) { //DMA+Polling
while(DMA Status' Busy)
yield();
}
}

Latency = 50 μs(yielding) + 1 μs(Check DMA) + 5μs(Compute) = 56 μs
Throughput = 1/6 = 166,667 Requests/Sec
Utilization = 5/6 = 84%


Summary of all methods

Method Latency(μs) Throughput(kb/s) Utilization(%)
Polling 100 10 5
Batching 1,000 21 10.5
Interrupt 106 18 8.9
DMA 61 91 45
DMA+Polling 56 167 84
***Note:DMA+Polling looks so good because the work flow is assumed to be very regular. For a general O.S. this approach would not work.