CS 111, Winter 2012 Scribe Notes

Lecture 11: Filesystem Design

by Alexander Chow

Filesystem Performance (continued)

When is prefetching bad?

Prefetching assumes locality of reference. It assumes that your next request will involve data that is next to your last request. The problem here is that this is not always true. If prefetched data is not used, that means that time has been wasted retrieving that data, and RAM used to hold that data is unneccessarily occupied.

Prefetching assumes cache coherence. But consider a computer with multiple CPUs, and each CPU has a local cache (not shared between CPUs). What happens when process 0 running on CPU 0 writes to some file A, and process 1 simultaneously reads from file A on CPU 1? Each loads A into cache and then does their respective read and writes, but process 1 may be reading stale data. It may be that process 1 needs to read the most recent A, but CPU 0 has not yet pushed the write to A from cache.

What can we do to fix this cache coherence problem? We can add layers! We can create a shared cache in the kernel. This introduces a race condition, but as we learned in previous lectures, this can be solved by placing locks on data.

To some extent, the problem with cache coherence is linked with dallying. Let us examine a problem with dallying. Suppose a process makes a write to some file A and immediately after the write, the system crashes. Did that write go through? Because of dallying, the write is only committed to cache and not to disk. The write that we thought succeeded did not actually do what we expected, and we lost the changes we wanted made.

What are some possible fixes? We can turn off dallying (there are system calls to do this), but that makes things really slow. We introduced dallying so that performance would be a lot better. How about write barriers? There is a sync() system call that flushes all writes to disk. Unfortunately, this is too global of a solution. It flushes ALL writes across all processes to disk, when we really only want to flush one write to disk, and we would have to wait for all of those writes to finish. Lucky for us, there is the system call fsync(fd) which flushes all cache associated with the file descriptor fd to disk. In addition, there is also fdatasync(fd), which is similar to fsync but does not update the metadata (e.g. last access time) and only flushes the data associated with fd to disk.

Filesystem Design

Recently, IBM announced the production of a 120 petabyte filesystem. How did they go about creating such a monstrous filesystem? First they needed to find the hardware that was both cheap and fast. They found this in a 600 GB hard drive. To reach the commissioned capacity of 120 PB, 200,000 of these drives were needed.

Problems?

A simple filesystem

We can model the physical disk as a big array of 512 byte sectors. All files will be stored contiguously, and we reserve sectors at the beginning of the disk for directory listings. These reserve sectors will serve as an array of directory entries, each of which describes a file and takes up 256 bytes (2 directory entries per sector). The first bit of the directory entry describes whether or not the entry is in use, and the directory entry stores the start sector, length, name, and timestamp of a particular file.

The RT-11 filesystem (ca. 1972) used this model.

What are some advantages to this model?

And disadvantages?

The FAT (File Allocation Table) filesystem

The purpose of FAT (ca. late 1970s) is attack external fragmentation as well as preallocation requirements. FAT introduces a level of indirection for file data, allocating data for a file in 4 KiB blocks so that data is not required to be stored contiguously.

There is space at the beginning of the filesystem for the following:

A directory is a file. The contents of the directory is an array of directory entries 16 bytes in size. The first 11 bytes stores the file name in the 8.3 convention (the first 8 bytes contain the file name, the last 3 contain the extension, so TEXTFILETXT is actually TEXTFILE.TXT), the next 3 bytes stores the file size, and the last two bytes contain the first block number of the file. It is very apparent that these small sizes are very limiting by today's standards, and these issues are addressed in later revisions of FAT. This description applies an early version of FAT. The root directory of the filesystem is contained in the superblock.

What are some problems with FAT?

Inodes

An inode is a structure devoted to a particular file. In a Unix filesystem, there is an inode table of fixed size instead of a FAT. An inode contains metadata about a file an an array of block pointers to blocks in the file. Since the inodes are a fixed size, files have a maximum size.

A traditional inode contains block pointers as follows:

Berkeley Filesystem

In the Berkeley filesystem (AKA the Unix filesystem), there is a block bitmap before the inode table, containing one bit per block in the filesystem that describes whether or not that particular block is free.

A file is linked directly to an inode number. We can have hard links (different filename but same inode), which requires keeping track of the number of links to an inode so we know when we can reclaim the blocks used by that inode.