CS111 Fall 2014 Lecture 10: File System Performance

Scribes: Oliver Huang, Jonathan Chu

File Systems

Things we need from file systems:

  1. Disk space, addresses, organization
  2. Performance
  3. Reliability

Example file system:

Let's have a file system that is:

How can we make a file system for this?

Idea 1: Have disk controllers maintaining a certain number of drives e.g. 16 drives/dc

img1

Problems: too slow

Idea 2 (better idea):

img2

Parallelize the system. This approach works better for us, and is the approach used in the General Parallel File System.

Performance Ideas:

General Parallel File System (GPFS)

GPFS is a good system to implement large file systems. The following are performance techniques used in such a system.

Idea 1: Striping

Involves splitting data so that one file is not stored entirely on one drive, instead the file data is broken up and stored in multiple drives. This allows the reading of a file to be done in parallel with a large number of drives, which speeds up the access time. Normally, the number of disks is a power of 2.

Idea 2: Distributed Metadata

Distributed Metadata: Metadata includes information about a file, and not the actual contents of the file. This includes locations, ownership, timestamps etc., much like a catalog entry. The idea of distributed metadata is that there is no central catalog; it is not all stored in one place. Instead, the metadata itself is striped.

Idea 3: Efficient Directory Indexing

Efficient Directory Indexing- If a directory contains a large number of files, say over 1 million, there we need an efficient data structure to store all of these. We have multiple ideas for this:

  1. Array- linearly stored data would have a runtime of O(n), which is not acceptable
  2. Hash Table or Binary Search Tree- these ideas are good in RAM, but our file system is large, much larger than what can be stored in ram, so the performance would not be acceptable.
  3. B+ tree- this idea is the most feasible. Unlike binary search tree's, B+ tree's can have a large branching factor, which can reduce the operations required to find an element of the tree.

Idea 4: Distributed Locking

Distributed Locking - the file system needs to be able to get locks on a file, even if the file is broken up (distributed). This problem is much harder

Idea 5: Partition Awareness

Partition Awareness - Say the file system is gets partitioned in two sections, this could happen, for example, in a network failure. A user in one section should be able to get files a separated partition. One solution to this problem would be to store copies of changes or reads made to the other partition; however, this introduces the problem of competing updates after the partitions are connected. The solution to this problem is majority rules, namely, the only the larger partition users can make changes, and the small side is read only.

Idea 6: File System Stays Live During Maintenance

File system stays live during maintenance - It is important that the file system is running as consistently as it can. One such time is during maintenance; we want users to be able to use the file system during this time.

GPFS is a big file system though. Let's think about how to build this from the bottom up. What hardware is important for file system components?

img3

We want to keep these things in mind while building file systems. The file system itself is a data structure that lives on all levels of this hierarchy; it is sort of hybrid storage. We always want to measure the price/performance, so the question becomes, how can we measure performance?

File System Performance Metrics

There are three main metrics for file system performance:

  1. Throughput:How many bytes per second can you get from file system?
  2. Latency:If I write 1 request to disk, how long will it take to get an answer back?
  3. Utilization:What fraction of CPU is doing useful work?

Like all systems, there are always tradeoffs between metrics. The top two performance metrics, throughput and latency, are competing objectives; efforts to increase the throughput tend to the increase latency, and efforts to decrease the latency tend to decrease the throughput. Our objective then, is to find a sweet spot between throughput and latency.

When we want to access a file on a disk, these are the things that we must do:

  1. Send command to controller
  2. Accelerate disk heads (seek time)
  3. Decelerate disk heads (seek time) //the seek time is the slowest part
  4. Disk Controller waits until disk rotation occurs (latency)
  5. Get data into disk controller cache
  6. Copy into Ram or CPU

To get a performance measure, we need to know how long this takes. Let's take a look at some example disk drives:

Example Device 1:

Seagate Barracuda ES2 1TB "near line"

Measuring the performance of Seagate Barracuda (external transfer rate, max internal and max sustained transfer rates, power) is much more straightforward than measuring reliability (MBTF, AFR, non reachable sector read failure rate) as the failure rates are so low that it is not economically feasible for manufacturers to wait for a failure to happen and record the time, thus the reliability numbers are often guesswork.

Example Device 2:

Corsair Force GT

The real (performance) advantage to SSD drives is that there is no seek time.

Methods to improve performance

Let our example device have the following timings:

Recall the performance metrics for I/O

These are the numbers we want to define.

Method 1: Polling

We want to check performance for the simplest case, where there is one read operation.

for (; ;){                                  send command to controller      5μs
    char buf[40]                            device latency                  50μs
    read 40 bytes from device to buf        read data                       40μs
    compute(buf)                            compute                         5μs
}                                           total latency:                  100μs

To calculate the throughput, we divide 1 by the total latency of one iteration of the for loop, which gives us the number of requests per second.

Throughput  =   1 requests
100μs

  =  10,000 req/s


To calculate the utilization, we divide the amount of time the CPU is doing "real work" (ie. computing) by the amount of "busy work" (ie. reading, device latency...)

Utilization  =   5μs
100μs

  =   5%


Method 2: Batching

Batching minimizes overhead by using bigger blocks of data, and attacks device latency.

Through batching, we read n sectors at a time, where n can be any number such as 1, 8, 16, 1024 or higher. Reading a large amount of sectors at a time can lead to great performance, however there is a tradeoff with internal fragmentation and external latency. A potential fix to this probelm is to read varying size blocks depending on the operation. This, however, can lead to higher complexity. Most of the times, we read in batches of 8 or 16 sectors.

Latency = send command + device latency + read + compute

Latency = 1000μs = 5μs + 50μs + 840μs + 105μs

Throughput  =   21 requests
1ms

  =  21,000 req/s


Utilization  =   105μs
1000μs

  =   10.5%


Method 3: Multitasking

Multitasking lets someone else run while we are waiting.

By allowing other threads to run while the current thread is waiting, multitasking allows us to minimize the amount of time the CPU sits around idly waiting, thus reducing the total latency when calculting throughput, and subsequently raising utilization.

for (; ;) {
	send command to controller			send 			5 μs
	do{ block until interrupt			block 			50 μs (do not consider in calculations)
		handle interrupt			handle			5 μs
	} while (!ready)				check			1 μs
	read buffer (40 bytes)				read 			40 μs
	compute 					compute			5 μs
}							total latency: 		106 μs
Throughput  =   1 request
56μs

  =  17,857 req/s


Utilization  =   5 μs
36μs

  =   8.9%


Method 4: Direct Memory Access (DMA)

Direct memory access (DMA) improves performance (tells device controller where to put data)

DMA is a method that tackles the PIO bottleneck by allowing the file system to access memory directly, which we have seen takes up a majority of the latency of the above methods.

While attacking the expensive overhead from the PIO bottleneck, DMA introduces an overhead with associated with the interrupt handler. Therefore if you have many I/Os, then the overhead from the interrupts can stack up and become the bottleneck.

Latency = block (do not consider in thruput calc.) + handle + check + compute

Latency = 61 μs = 50μs + 5μs + 1μs + 5μs

Throughput  =   1 request
11μs

  =  91,000 req/s


Utilization  =   5μs
11μs

  =  45%


Method 5: DMA + Polling

In order to tackle the issue of the interrupt handler overhead in DMA, we now introduce a new method that combines DMA and polling. In DMA + Polling, we do not use interrupts. Instead, after issuing our commands to the disk controller and the computations, we start polling to see if the disk controller is done and ready. If the polling and the disk controller's times are well timed, then we can further increase throughput and utilization by throwing out the interrupt overhead while only introducing a tivial busywait that comes with polling.

for (; ;){		
	while(DMA slots busy)
		yield()
}

With the above DMA + Polling approach, we get the following latency, throughput, and utilization:

Latency = yield + check DMA + compute

Latency = 56μs = 50 μs + <1 μs + 5μs

Performance Summary

Throughput  =   1 request
6μs

  =  166,667 req/s


Utilization  =   5μs
6μs

  =  84%


Method

Latency (μs)

Throughput (Kb/s)

Utilization (%)

Polling

100

10

5

Batching

1000

21

10.5

Interrupts

106

18

8.9

DMA

61

91

45

DMA + polling

56

167

84

It seems like DMA w/ polling is the best performance measure by far. However, it may not always be the best choice. The reason it was so good in this example is that this problem was very well synchronized and a very regular workflow. For a general-purpose operating system, such as Linux, there is no guarantee of this regularity, so this will not work as well. Though, for some embedded systems with more regularity, it might still be the best choice.