Scribe Notes - Lecture 11: Filesystem Design

Filesystem Performance (continued)

When is prefetching bad?

Prefetching assumes locality of reference. It assumes that your next request will involve data that is next to your last request. The problem here is that this is not always true. If prefetched data is not used, that means that time has been wasted retrieving that data, and RAM used to hold that data is unneccessarily occupied.

Prefetching assumes cache coherence. But consider a computer with multiple CPUs, and each CPU has a local cache (not shared between CPUs). What happens when process 0 running on CPU 0 writes to some file A, and process 1 simultaneously reads from file A on CPU 1? Each loads A into cache and then does their respective read and writes, but process 1 may be reading stale data. It may be that process 1 needs to read the most recent A, but CPU 0 has not yet pushed the write to A from cache.

What can we do to fix this cache coherence problem? We can add layers! We can create a shared cache in the kernel. This introduces a race condition, but as we learned in previous lectures, this can be solved by placing locks on data.

To some extent, the problem with cache coherence is linked with dallying. Let us examine a problem with dallying. Suppose a process makes a write to some file A and immediately after the write, the system crashes. Did that write go through? Because of dallying, the write is only committed to cache and not to disk. The write that we thought succeeded did not actually do what we expected, and we lost the changes we wanted made.

What are some possible fixes? We can turn off dallying (there are system calls to do this), but that makes things really slow. We introduced dallying so that performance would be a lot better. How about write barriers? There is a sync() system call that flushes all writes to disk. Unfortunately, this is too global of a solution. It flushes ALL writes across all processes to disk, when we really only want to flush one write to disk, and we would have to wait for all of those writes to finish. Lucky for us, there is the system call fsync(fd) which flushes all cache associated with the file descriptor fd to disk. In addition, there is also fdatasync(fd), which is similar to fsync but does not update the metadata (e.g. last access time) and only flushes the data associated with fd to disk.

Filesystem Design

Recently, IBM announced the production of a 120 petabyte filesystem. How did they go about creating such a monstrous filesystem? First they needed to find the hardware that was both cheap and fast. They found this in a 600 GB hard drive. To reach the commissioned capacity of 120 PB, 200,000 of these drives were needed.

Problems?

All disks need to be consistent. It may be that data for a particular file is stored across multiple disks, and there must be a guarantee that a successful write means that the write does make changes across all data associated with that file.
One CPU cannot handle I/O to 200,000 disks. We will need multiple CPUs, with each CPU handling I/O to some number of disks. This requires a high throughput, low latency interconnect for all CPUs to be able to communicate.
Maximize parallelism. With so many disks across so many CPUs, there is some inherent parallelism when retrieving or storing data to the filesystem. We will need to maximize parallelism to maximize performance.
As mentioned earlier, the filesystem also needs to be robust (write barriers), and there needs to be cache coherence.

A simple filesystem

We can model the physical disk as a big array of 512 byte sectors. All files will be stored contiguously, and we reserve sectors at the beginning of the disk for directory listings. These reserve sectors will serve as an array of directory entries, each of which describes a file and takes up 256 bytes (2 directory entries per sector). The first bit of the directory entry describes whether or not the entry is in use, and the directory entry stores the start sector, length, name, and timestamp of a particular file.

The RT-11 filesystem (ca. 1972) used this model.

What are some advantages to this model?

Access to files is more predictable. There is no file fragmentation.
We can cache the directory in RAM.

And disadvantages?

Internal fragmentation - We have unused space within a sector if the entire sector is not occupied by the file, but this is acceptable. The amount of unused space is very small, and it is a small price to pay for aligned access to data.
External fragmentation - Suppose we have some file that requires half of the disk capacity and there is some file that exists in the middle of disk. We cannot allocate space for the file because the existing file straddles both halves of the disk.
Preallocation required - We need to know the size of the file before we can allocate space for it. Shrinking a file is easy to do, but growing the file is tricky. We have to find a portion of the disk that can hold the new size of the file.
Fixed directory size - There are a limited number of files we can hold, but this is a price we are willing to pay for the directories to fit in RAM.

The FAT (File Allocation Table) filesystem

The purpose of FAT (ca. late 1970s) is attack external fragmentation as well as preallocation requirements. FAT introduces a level of indirection for file data, allocating data for a file in 4 KiB blocks so that data is not required to be stored contiguously.

There is space at the beginning of the filesystem for the following:

boot sector
superblock - holds metadata about the filesystem (e.g. the size of the filesystem, what FAT version)
FAT (file allocation table) - An array of block numbers. An entry 0 means EOF, an entry of -1 means a free block, otherwise the entry holds the block number for the next block in the file. You can think of the FAT as a linked list. Note, the FAT can be stored in RAM.

A directory is a file. The contents of the directory is an array of directory entries 16 bytes in size. The first 11 bytes stores the file name in the 8.3 convention (the first 8 bytes contain the file name, the last 3 contain the extension, so TEXTFILETXT is actually TEXTFILE.TXT), the next 3 bytes stores the file size, and the last two bytes contain the first block number of the file. It is very apparent that these small sizes are very limiting by today's standards, and these issues are addressed in later revisions of FAT. This description applies an early version of FAT. The root directory of the filesystem is contained in the superblock.

What are some problems with FAT?

We assume that the FAT is cacheable, but that assumption does not scale well.
We now have file fragmentation. We like to have contiguous allocation of files, which does happen during early use of the filesystem. After a while, files will not be stored in contiguous blocks. The solution? A file defragmenter, a program which moves blocks around so that files will be stored in contiguous blocks.
lseek is now O(N).
Renaming between directories requires two writes to the data area. If a failure occurs between the two writes, bad things happen, so FAT does not allow this.

Inodes

An inode is a structure devoted to a particular file. In a Unix filesystem, there is an inode table of fixed size instead of a FAT. An inode contains metadata about a file an an array of block pointers to blocks in the file. Since the inodes are a fixed size, files have a maximum size.

A traditional inode contains block pointers as follows:

10 direct block pointers
an indirect block pointer - a pointer to a block containing an array of block pointers (on a filesystem with 8KiB blocks and 4 byte block pointers, there are 2000 block pointers in an indirect block)
a doubly indirect block - a pointer to a block containing an array of indirect block pointers (that means 2000*2000 blocks or 32 GiB!)

Berkeley Filesystem

In the Berkeley filesystem (AKA the Unix filesystem), there is a block bitmap before the inode table, containing one bit per block in the filesystem that describes whether or not that particular block is free.

A file is linked directly to an inode number. We can have hard links (different filename but same inode), which requires keeping track of the number of links to an inode so we know when we can reclaim the blocks used by that inode.

CS 111, Winter 2012 Scribe Notes