Lecture 10: File Systems

By: Brendan Sio, Angel Chen, Eric Oh, Gene Shaffer

Why do we need file systems?

File system is a way for operating systems to retrieve or store a file. Without file systems, information stored would be one gigantic piece of data with no way of telling where each file begins and ends.

What do we look for in a file system?

Disk space – often times, programs are too large to be stored on the RAM. Files provide more space on the disk.

Organization – the file system keeps track of files by using names for easy access.

Performance – efficient search and access of files; high accessibility.

Pattern – an efficient way of organizing files.

Reliability – we want our file system to be robust and secure.

Organization vs. Performance

To reach optimal performance, we require an efficient pattern for organizing files. Often times, the organization and implementations clash, thus decreasing efficiency.

Example: We have 120PB, or 120,000 TB of information. To help visualize, imagine 200,000 hard drives with 600 GB each. We want a storage method that will spin quickly and also be low cost. What can we do to retrieve information?

We cannot have multiple disk controllers attached to one bus because the bus will become flooded. It will be many times worse than the traffic on 405 during rush hour. Retrieval time would increase to almost infinity.

One possible solution is to implement a specialized network with many CPU/bus, each with a reasonable amount of disk controllers attached, along with clients.

Such a network of CPU’s is otherwise known as a General Parallel File System (GPFS).

It is a high performance file system developed by IBM. How is this file system implemented?

General Parallel File System (GPFS)

Stripes

If the file is gigantic, we don't store the entire file on one disk or on adjacent disks. Rather, we split it into pieces and place it in different places. The CPU Catalog will then map the files to the different locations, and access it accordingly. However, this can cause a bottleneck if many files try to open the CPU catalog at the same time.

To solve this problem, we use:

Distributive Metadata

Instead of using a central catalog, we stripe the CPU catalog itself to fix the bottleneck issue. Metadata is data that contains information about a file, such as the file location, owners, and timestamp. Distributed metadata consists of many nodes in the file system, with each node containing more metadata. There isn’t a singular “control” node, but rather, a cloud of nodes. This allows metadata to be easily and quickly accessed.

Efficient Directory Indexing

The directory is a structure that holds many files. While searching through all files, having O(n) indexing, where n is the number of indexes, is unacceptable. It would be too slow. Ideally, we would want it to be O(1). But how could we implement this efficiently? Possible structures spawning from discussion were: hash table, balanced binary search tree, or a 2-3 tree.

Using a hash table would require us to know the exact amount of files in advance. The user would have no flexibility.

The balanced binary search tree is not a good option simply because indexing can be often random, and to sort the entropy into a BST is inefficient/counteractive. There are also many other structures that may or may not be effective.

Distributed Locking

Distributed locking uses an arbitrarily named lock in which only one application can own at a time. It can be used to coordinate access to shared files. However, this is harder to implement.

Partition Awareness

Sometimes data is partitioned, or split, and we would need to be aware of how we partition our files. When a system has network partition awareness, the largest partition has priority and stays live. This partition has both reading and writing privileges. Conversely, the smaller partitions are “frozen” and only has reading privileges. If the two partitions were looking at the same file and both tried to edit that file, then network partition awareness will guarantee that the file will not become corrupted.

File system must stay live during maintenance

There are times when users are unable to connect to a server, such as the SEASNet Linux server, because they are doing maintenance on it. If the server could stay live during maintenance, this could be more convenient for users; however, it would be harder to implement because some user actions taken during maintenance could potentially be dangerous for the server or file system.

Hardware Components for a File System

When designing a file system organization, we must take into account storage hierarchy. In Linux, the root folder is the parent of all files. Its children include the Home folder, Documents, etc.

When choosing a data structure for (the CPU catalog?), Hash tables and binary search trees come to mind. BST does not take into account of storage hierarchy and is not recommended.

How do we measure performance?

Depth of memory access - how far we go into memory to get the information. The deeper the memory, the longer the turnaround time to access a file.

Throughput is how many bytes/seconds can be transferred in I/O. It determines the load time of copying data from file into RAM.

Latency is the turnaround time for 1 request. We can think of it in terms of this question: "If a request is sent to file system, how long before we get a response?" The waterbed effect actually helps in this case. If we increase throughput, latency would decrease. Throughput is to yin as Latency is to yang.

Performance Implications of a disk (5400-15,000 rpm)

Send command to controller
Accelerate disk heads in the right direction (seek time)
Decelerate disk heads halfway to the target track (seek time)
Wait until disk containing sector with the data to rotate and pass under the disk head, which is rotational latency.
Get data into the disk controller cache
Copy into RAM or CPU

Note: Steps 2-4 have to be done sequentially, they cannot be done in parallel

Example of Hard Disk Drive (HDD):
- Seagate Barnacula ES2 1 TB ("nearline" - cheap storage but not fast access)
  - 16 MiB cache
  - 7200 rpm = 120Hz
  - 8.333 ms for rotation time (worst time of step 4 - rotational latency)
  - 4.166 ms for average rotational latency "assuming random reads"
  - 8.5 ms for average read seek time "assuming random reads"
    - time it takes to read another part (need to speed up and then slow down to the part)
  - 9.5 ms for average write seek time "assuming random reads"
  - 0.8 ms track-to-track read seek
- Why is write seek time slower (takes extra 1ms) than read seek time?
  - Writing: sectors must be centered exactly on.
  - Reading: sectors can be slightly off center and data can still be read
    - Precision is not as important for read, just need to pick up a “signal”
- HDD transfer rates
  - External Transfer Rate: 3Gb/s (cache to cache)
  - Maximum Internal Transfer Rate (reads): 1.29 Gb/s (Disk to cache)
  - Maximum Sustained Transfer Rate (reads): 0.93 Gb/s (Overhead of switching tracks)
  - Power Requirements: typically 12.5 Watts, 9.0 Watts is ideal.
  - Mean Time Between Failure (MTBF): 1.2 Million Hours
  - Annualized Failure Rate (AFR): 0.73%
    - If a computer ran for a year, 0.73 out of 100 disks would fail.
  - Non-recoverable Sector Read Failure Rate: 10-15
  - Note: Performance is easy to determine but reliability is hard to determine, thus this is all guesswork.
- Example of Solid State Drive (SSD):
  - Corsair Force GT 60GB
    - 3GB/s
    - Power usage: 2.0 Watts Max
    - MTBF: 1 million hours
    - Shock Resistance of 1000G
    - 280 MB/s read (assuming sequential)
    - 270 MB/s writes (assuming sequential)

Performance Metrics Example
Computing Data from Device (busy wait and polling)

Example Device
- 1 ns cycle time in CPU
- 1 µs programmed I/O = 1000 cycles
- 5 µs send command = 5 PIO
- 50 µs device latency
- 5 µs 5000 cycles for CPU to process 40 bytes
- 40 µs read 40 bytes from device (40 PIO)
  - Depends on the number of bytes, but for the sake of example 40 bytes.
Numbers we want
- Throughput: bytes per process
- Latency: delay from request to response
- Utilization: what % of CPU time is devoted to doing useful work

Action	Time (in μs)
Send Command to Device	5
Device Latency	50
Read Latency	40
Computation (useful work)	5

for(;;) {
        char buf[40];
                /* send command to controller 5 µs + device
                latency 50 µs + read data 40 µs + compute 5 µs
                = 100 µs latency | 1/latency = 10,000
                requests throughput | utilization: 5/100 = 5%
                */
        read 40 bytes from device to buf
        compute(buf);
}

Batching

Batching: to increase buffer size and read bytes in big blocks
char buf[40] <- change 40 to 840
read 40 840 bytes //840 = 21*40

Action	Time (in μs)
Send Command to Device	5
Device Latency	50
Read Latency	840
Computation (useful work)	105

latency: send command 5 µs + device latency 50 µs + read 840 µs + compute 105 µs = 1000 µs
throughput = 21 requests/(1 ms = 1000 µs) = 21000 requests per second (gone up by a factor of 21)
We saved time because we only send command to device once and account for device latency once.
utilization: 105 µs /1000 µs = 10.5%
*Batching is an attack on the device latency

Multitasking with Interrupt

Multitasking: while we wait for the device to be ready, let someone else run (under the assumption that there are multiple tasks to be done)
This only improves performance when the previous assumption is true.

for(;;) {
        send command to controller
        do {
                block until interrupt
                handle interrupt
        } while (! ready)
        read buffer (40 bytes)
        compute(buf);
}

Action	Time (in μs)
Send Command to Device	5
Handling Interrupt	5
Check if read	1
Read Latency	40
Computation (useful work)	5

latency: send 5 µs + block 50 µs + handle interrupt 5 µs+ check if ready 1 µs + read 40 µs + compute 5 µs = 106 µs (6 us added because of handle and check)
throughput: 1/56 µs = 17,857 req/s (assumes 50 µs from block is used to do useful work)
utilization: 5/56 = 8.93%

Direct Memory Access (DMA) with Interrupts

DMA To Improve Performance: tells device controller where to put data
DMA is an attack on the read time of 4o us.
*shrinks 40 us; no read instruction time since data is taken directly out of RAM

Action	Time (in μs)
Block Latency	50
Handling Interrupt	5
Check if Ready	1
Computation (useful work)	5

latency: block 50 µs + handle 5 µs + check if ready 1 µs + compute 5 µs = 61 µs (Now the handle is the biggest bottleneck in DMA method so we will try to get rid of interrupts by changing program not to use them at all)
throughput: 1/11 µs = 91,000 req/s
utilization: 5/11 = 45%

DMA With Polling (Without interrupts)

for(;;) {
while (DMA slots busy)
yield(); //CPU to work on other threads
}

Action	Time (in μs)
Yield()	50
Check DMA Slots	1
Computation (useful work)	5

latency: yield 50 µs + check DMA slot 1 µs+ compute 5 µs = 56 us
throughput: 1/6 µs = 166,666 req/s
utilization: 5/6 = 83.3%

Summary

Method	Latency (μs)	Throughput(Kb/s)	Utilization
Polling	100	10	5%
Batching	1000	21	10.5%
Interrupts	106	18	8.9%
DMA	61	91	45%
DMA w/polling	56	167	84%

DMA w/polling seems to be the best result; However, this is because the size of jobs are same and come in an assembly line like manner.
Synchronized problem; Workflow is regular.
For General Purpose systems like Linux, it is not the best because everything is randomized, thus no regularity.
Useful for special embedded programs.