CS111 Fall 2014 Lecture 10: File System Performance

Scribes: Oliver Huang, Jonathan Chu

File Systems

Things we need from file systems:

Disk space, addresses, organization
Performance
Reliability

Example file system:

Let's have a file system that is:

120 PB of storage ~ 120,000 TB
That consists of 200,000 600GB hard drives (because that’s best for price/storage)

How can we make a file system for this?

Idea 1: Have disk controllers maintaining a certain number of drives e.g. 16 drives/dc

Problems: too slow

Idea 2 (better idea):

Parallelize the system. This approach works better for us, and is the approach used in the General Parallel File System.

Performance Ideas:

General Parallel File System (GPFS)

GPFS is a good system to implement large file systems. The following are performance techniques used in such a system.

Idea 1: Striping

Involves splitting data so that one file is not stored entirely on one drive, instead the file data is broken up and stored in multiple drives. This allows the reading of a file to be done in parallel with a large number of drives, which speeds up the access time. Normally, the number of disks is a power of 2.

Idea 2: Distributed Metadata

Distributed Metadata: Metadata includes information about a file, and not the actual contents of the file. This includes locations, ownership, timestamps etc., much like a catalog entry. The idea of distributed metadata is that there is no central catalog; it is not all stored in one place. Instead, the metadata itself is striped.

Idea 3: Efficient Directory Indexing

Efficient Directory Indexing- If a directory contains a large number of files, say over 1 million, there we need an efficient data structure to store all of these. We have multiple ideas for this:

Array- linearly stored data would have a runtime of O(n), which is not acceptable
Hash Table or Binary Search Tree- these ideas are good in RAM, but our file system is large, much larger than what can be stored in ram, so the performance would not be acceptable.
B+ tree- this idea is the most feasible. Unlike binary search tree's, B+ tree's can have a large branching factor, which can reduce the operations required to find an element of the tree.

Idea 4: Distributed Locking

Distributed Locking - the file system needs to be able to get locks on a file, even if the file is broken up (distributed). This problem is much harder

Idea 5: Partition Awareness

Partition Awareness - Say the file system is gets partitioned in two sections, this could happen, for example, in a network failure. A user in one section should be able to get files a separated partition. One solution to this problem would be to store copies of changes or reads made to the other partition; however, this introduces the problem of competing updates after the partitions are connected. The solution to this problem is majority rules, namely, the only the larger partition users can make changes, and the small side is read only.

Idea 6: File System Stays Live During Maintenance

File system stays live during maintenance - It is important that the file system is running as consistently as it can. One such time is during maintenance; we want users to be able to use the file system during this time.

GPFS is a big file system though. Let's think about how to build this from the bottom up. What hardware is important for file system components?

We want to keep these things in mind while building file systems. The file system itself is a data structure that lives on all levels of this hierarchy; it is sort of hybrid storage. We always want to measure the price/performance, so the question becomes, how can we measure performance?

File System Performance Metrics

There are three main metrics for file system performance:

Throughput:How many bytes per second can you get from file system?
Latency:If I write 1 request to disk, how long will it take to get an answer back?
Utilization:What fraction of CPU is doing useful work?

Like all systems, there are always tradeoffs between metrics. The top two performance metrics, throughput and latency, are competing objectives; efforts to increase the throughput tend to the increase latency, and efforts to decrease the latency tend to decrease the throughput. Our objective then, is to find a sweet spot between throughput and latency.

When we want to access a file on a disk, these are the things that we must do:

Send command to controller
Accelerate disk heads (seek time)
Decelerate disk heads (seek time) //the seek time is the slowest part
Disk Controller waits until disk rotation occurs (latency)
Get data into disk controller cache
Copy into Ram or CPU

To get a performance measure, we need to know how long this takes. Let's take a look at some example disk drives:

Example Device 1:

Seagate Barracuda ES2 1TB "near line"

16 MB cache
7200 rpm = 120 Hz
8.333 ms rotation time (worst case)
4.166 ms ave. rotational latency assuming random needs
8.5 ms ave. read seek time assuming random needs
9.6 ms ave. read seek time assuming random needs
0.8 ms track to track read
external transfer rate = 3Gb/s
max internal transfer rate (read) = 1.29 Gb/s
max sustained transfer rate (read) = 0.93 Gb/s
power = 12.5 W (typical), 9.0 W (idle)
mean time between failure (MBTF) = 1.2 million hours
annualized failure rate (AFR) = 0.73%
nonrecoverable sector read failure rate = 10^(-15)

Measuring the performance of Seagate Barracuda (external transfer rate, max internal and max sustained transfer rates, power) is much more straightforward than measuring reliability (MBTF, AFR, non reachable sector read failure rate) as the failure rates are so low that it is not economically feasible for manufacturers to wait for a failure to happen and record the time, thus the reliability numbers are often guesswork.

Example Device 2:

Corsair Force GT

3 Gb/s external transfer
2.0 W max
60 GB hard disk for about $90
MBTF : 1 million hours
Shock resistance : 1000 G
read : 280 MB/s (sequential)
write : 270 MB/s (sequential)

The real (performance) advantage to SSD drives is that there is no seek time.

Methods to improve performance

Let our example device have the following timings:

1 ns cycle time in CPU
1 μs programmed I/O = 1000 cycles
5 μs send command = 5 PIO
50 μs device latency
40 μs read/write bytes from device
5 μs 5000 cycles for CPU to process 40 bytes

Recall the performance metrics for I/O

Throughput: rate of request completion (byte/sec)
Latency: delay from request to its response (secs)
Utilization: percentage of CPU devoted to useful work

These are the numbers we want to define.

Method 1: Polling

We want to check performance for the simplest case, where there is one read operation.

for (; ;){                                  send command to controller      5μs
    char buf[40]                            device latency                  50μs
    read 40 bytes from device to buf        read data                       40μs
    compute(buf)                            compute                         5μs
}                                           total latency:                  100μs

To calculate the throughput, we divide 1 by the total latency of one iteration of the for loop, which gives us the number of requests per second.

Throughput =	1 requests
	100μs

= 10,000 req/s

To calculate the utilization, we divide the amount of time the CPU is doing "real work" (ie. computing) by the amount of "busy work" (ie. reading, device latency...)

Utilization =	5μs
	100μs

= 5%

Method 2: Batching

Batching minimizes overhead by using bigger blocks of data, and attacks device latency.

Through batching, we read n sectors at a time, where n can be any number such as 1, 8, 16, 1024 or higher. Reading a large amount of sectors at a time can lead to great performance, however there is a tradeoff with internal fragmentation and external latency. A potential fix to this probelm is to read varying size blocks depending on the operation. This, however, can lead to higher complexity. Most of the times, we read in batches of 8 or 16 sectors.

Latency = send command + device latency + read + compute

Latency = 1000μs = 5μs + 50μs + 840μs + 105μs

Throughput =	21 requests
	1ms

= 21,000 req/s

Utilization =	105μs
	1000μs

= 10.5%

Method 3: Multitasking

Multitasking lets someone else run while we are waiting.

By allowing other threads to run while the current thread is waiting, multitasking allows us to minimize the amount of time the CPU sits around idly waiting, thus reducing the total latency when calculting throughput, and subsequently raising utilization.

for (; ;) {
	send command to controller			send 			5 μs
	do{ block until interrupt			block 			50 μs (do not consider in calculations)
		handle interrupt			handle			5 μs
	} while (!ready)				check			1 μs
	read buffer (40 bytes)				read 			40 μs
	compute 					compute			5 μs
}							total latency: 		106 μs

Throughput =	1 request
	56μs

= 17,857 req/s

Utilization =	5 μs
	36μs

= 8.9%

Method 4: Direct Memory Access (DMA)

Direct memory access (DMA) improves performance (tells device controller where to put data)

DMA is a method that tackles the PIO bottleneck by allowing the file system to access memory directly, which we have seen takes up a majority of the latency of the above methods.

While attacking the expensive overhead from the PIO bottleneck, DMA introduces an overhead with associated with the interrupt handler. Therefore if you have many I/Os, then the overhead from the interrupts can stack up and become the bottleneck.

Latency = block (do not consider in thruput calc.) + handle + check + compute

Latency = 61 μs = 50μs + 5μs + 1μs + 5μs

Throughput =	1 request
	11μs

= 91,000 req/s

Utilization =	5μs
	11μs

= 45%

Method 5: DMA + Polling

In order to tackle the issue of the interrupt handler overhead in DMA, we now introduce a new method that combines DMA and polling. In DMA + Polling, we do not use interrupts. Instead, after issuing our commands to the disk controller and the computations, we start polling to see if the disk controller is done and ready. If the polling and the disk controller's times are well timed, then we can further increase throughput and utilization by throwing out the interrupt overhead while only introducing a tivial busywait that comes with polling.

for (; ;){		
	while(DMA slots busy)
		yield()
}

With the above DMA + Polling approach, we get the following latency, throughput, and utilization:

Latency = yield + check DMA + compute

Latency = 56μs = 50 μs + <1 μs + 5μs

Performance Summary

Throughput =	1 request
	6μs

= 166,667 req/s

Utilization =	5μs
	6μs

= 84%

Method	Latency (μs)	Throughput (Kb/s)	Utilization (%)
Polling	100	10	5
Batching	1000	21	10.5
Interrupts	106	18	8.9
DMA	61	91	45
DMA + polling	56	167	84

It seems like DMA w/ polling is the best performance measure by far. However, it may not always be the best choice. The reason it was so good in this example is that this problem was very well synchronized and a very regular workflow. For a general-purpose operating system, such as Linux, there is no guarantee of this regularity, so this will not work as well. Though, for some embedded systems with more regularity, it might still be the best choice.