File System Design

General Parallel File System (GPFS)

Suppose we have a system that uses 120 PB of storage. We don't want to use a single hard drive because that would be super slow. So, let's use 200,000 drives, each 600 GB. Hard drives nowadays easily top 1 to 2 TB of storage, so why 600 GB? There are two main reasons: we can read more quickly from smaller hard drives and this system is several years old.

This begs the question, though: how do we manage 200,000 hard drives? We could use a single CPU attached to all of these drives via a single bus, but this would lead to horrible performance bottlenecks. Instead, we'll use the General Parallel File System (GPFS) developed by IBM. GPFS uses a network of computers, each of which is hooked up to a subset of hard drives, to efficiently parallelize I/O.

Features of GPFS

Striping. One clever optimization GPFS uses has been around since the 1960s: striping. The idea is that we want to maximize disk throughput (the amount of data flowing into or out of the disk per unit of time), but don't want to mess around with hard drive hardware. We can do this by cutting a single file into several chunks and then placing each of those chunks in a separate disk. When we want to read a file, we just have to query the drives of interest and combine the chunks that we receive. Though we get better thoroughput, we do have to be wary about using striping. Suppose, for example, that we have a file spread out across ten drives and one of them goes down; we now have ten points of potential failure, though we can mitigate this problem by creating redundant copies of our data.

Distributed Metadata. Metadata is data about other data. In terms of file systems, data about files other than the actual data in the files. Many existing file systems, such as Hadoop, have centralized metadata storage: to find out where a file is, you consult a single CPU, but this can lead to bottlenecks. In contrast, GPFS lets us ask multiple CPUs about where files are.

Partition Awareness. Suppose part of our network goes down - we still want to be able to access our data. Accomplished by creating redundant copies of data throughout the network.

Distributed Locking. When working with very large systems, we need to make sure that we serialize read, write, and other resource requests. GPFS uses distributed locking to manage lock requests and does so efficiently using clever optimizations such as locking only part of a resource.

Efficient Directory Indexing. Suppose a directory has tons of files - sequentially searching for a file is going to be slow. GPFS achieves sublinear performance by using performant data structures.

File System Maintenance. Over time, file systems build up inconsistencies. It's a good idea to clean out the junk on a semi-regular basis, but we don't want our system to go offline in this time. GPFS handles this for us and lets us perform maintenance without having to take the entire network down.

File Systems

All this being said, what actually is a file system?

A file system is an abstract organization of data on secondary storage that is big, persistent, and slow. It should survive program and OS crashes as well as power outages.

File systems shouldn’t be concerned with extremely low level details, such as disk sector sizes. But that raises the question: how abstract should it be? We need to be able to read, write, read-write, move, allocate, and delete data. Some file systems also support searching data, but this leads to greater complexity because there are so many ways to search through data - for example, most developers can’t even agree on a single regex format.

Professor Eggert's File System

When Professor Eggert was an undergraduate student, his Professor was developing a competitor to Unix. He tasked Professor Eggert with creating the file system for this Unix competitor. These are the machine and OS specifications Professor Eggert was given:

16 KiB RAM
700 KiB disk (consisting of 512-byte sectors)
16-bit Operating System

Keep in mind that Professor Eggert had no idea about how to write a file system at this time - he hadn't taken an operating systems class because... those didn't exist yet! Professor Eggert's approach for the file system consisted of dividing the disk into two parts.

A table (arbitrarily made to be 2 sectors long), with each entry consisting of a:
- Name (8 bytes)
- Starting Sector (12 bytes)
- Size in bytes (2 bytes)
- Timestamps & other miscellaneous information (4 bytes)
Some things worth noting about the table:
- If an entry's name is shorter than 8 bytes (the allotted quantity for the name field in the table), the convention is to use null bytes to fill the empty bytes. Note that this means that names can be at most 8 bytes long.
- A name field consisting completely of null bytes indicates that the table entry is not in use; inherently, empty file names are not allowed.
- There is space for 2⁶ total files.
The file-data (taking up the remainder of the disk)

Now let's analyze some of the pros and cons of the file system that Professor Eggert came up with.

Pros

It's simple - the design is pretty easy to understand and work with.
It allows for fast sequential access - because all the data for a file is stored sequentially in disk, the disk arm doesn't have to move much.
It's predictable - because the table stores information on how big each file is, you know how much the disk arm has to move.

Cons

The number of files that we can support is limited by code.
No support for permissions.
Pre-allocation is required - when you create a file, you have to specify how big it is for the table entry. You can't resize files in the future because doing so might overwrite blocks next to the file of interest.
The internal fragmentation is up to 511 bytes.
- Internal fragmentation can be thought of as "wasted space by design." The idea is that we may want to allocate, say, 200 bytes, but have to allocate 512 total bytes because that's how large our sectors are. At worst, we'll waste 511 bytes if we allocate only one byte.
The fatal flaw: After using the file system for a while, suppose you've stored and deleted lots of files. Eventually, we may want to allocate a pretty big file - one that crosses several sectors, for example. We'll need a contiguous region of memory to do this, but won't be able to find one even though we may have enough free memory scattered throughout our file system. This problem, termed external fragmentation, is usually dealt with by using a defragmenter, which simply re-aligns memory blocks so that we re-obtain contiguous blocks of memory. However, defragmentation is not trivial and is performance intensive.

Surprisingly, some file systems were built this way: RT-ll (DEC 1970). The developers of RT-ll chose this paradigm because it makes working with real-time systems easier (sequential access is fast and predictable because files are stored in contiguous segments on disk).

FAT File System

Around the same time Professor Eggert was building his file system, engineers at Microsoft were also working on the same problem. They, however, did not want to pre-allocate or run into the above fragmentation issues. So, they came up with the FAT file system, with the following design:

First, leave an empty sector at the beginning of the disk
- This is used for the boot sector.
The next section is called the super block.
- The super block contains information about the file system - version number of the file system, size (how many blocks in the file system?), used block count, and other global information.
Next, we have the File Allocation Table, which helps us keep track of where files start and end via a linked list.
- Each entry in the FAT maps directly to a block on disk; if we have n blocks on disk, then we'll also have n entries in the FAT.
- Each entry also contains an id that points to the next block to look at when we reach the end of the current cluster.
- When we're reading a file and we reach the end of a block, we can follow the next pointer to find the next block that we need to read from until we reach an entry whose value is 0.
- When we're writing to a file, we just need to find an empty entry in the allocation table (look for any entries whose value is -1).

We need at least one root directory in our filesystem - every other file is going to be a descendant of some sort. We can represent files as directory entries, where each entry keeps track of the file's name, where the file starts, and the file's size. Moreover, we can represent directories as files by flipping a few bits that tell us what kind of file we're working with. Regardless, let's say that we pick file i and, by looking at its directory entry, find that it starts at block number j on disk. We can then read from disk, starting at block j until we reach a block whose next value is set to EOF.

Let's say we wanted to list all the files in our file system. We'd first find the root directory and list all normal files directly in it. We'd then find all subdirectories and jump to the starting block specified in the directory entry. When we do so, we can do the same thing we did for the root directory - list all normal files, find all subdirectories and jump to their starting blocks, and so on.

FAT File system

Image courtesy of magic-partition.com.

Directories are just files!

A directory consists of an array of directory entries, where each entry is composed of a:

Name
Type
Size (n bytes)
Block number (index of the 1st block of the file)

FAT Pros and Cons

Pros

No external fragmentation. It doesn't matter if blocks are scattered; the linked list will handle it!
No pre-allocation needed. Because we don't need contiguous blocks of memory to hold a single file, files can grow files as needed.
The number of files is not pre-determined by the file system.

Cons

Sequential access can be slow because file data isn't necessarily stored in contiguous blocks.
- However, defragmentation can help alleviate this problem.
With FAT, a function like lseek() is O(N). If you wanted to inspect the middle of the file, you would have to traverse the linked list to get to the block with the relevant data.

Another Note on FAT Files

Let's say you wanted to rename a FAT file within the same directory.

For example, using "$ mv foo.c bar.c".

This is easy - the FAT file system would just simply

Find the foo.c block.
Load it into RAM.
Then update the file entry's name and write the block again.

But say you wanted to move a FAT file between two distinct directories.

For example, using "$ mv a/foo.c b/foo.c".

This is difficult - the FAT file system would have to create a new block in the new directory and remove the old block from the previous directory, but several issues can arise if the process is interrupted (say, if the OS crashes). The old file could still hang around, for example, or it may even be deleted before the new file has been fully initialized. FAT solves this problem by prohibiting file renaming between multiple directories.

CS111 Winter 2016 File System Design

Usama Riaz, Aditya Srinivasan

General Parallel File System (GPFS)

Features of GPFS

File Systems

Professor Eggert's File System

Pros

Cons

FAT File System

Directories are just files!

FAT Pros and Cons

Pros

Cons

Another Note on FAT Files