Scribe Notes Lecture 13 (Winter 2016)

by Michael Xiong, Wenlong Xiong, Eric Yuan, and Karen Zhang

Table of Contents

  1. Gzip Bug
  2. File System Implementation
  3. Performance Issues with Filesystems
  4. Input/Output Metrics
  5. Improving File System Performance
  6. Scheduling Algorithms
  7. File System Invariants

Gzip Bug

The gzip program was written by Jean-loup Gailly and Mark Adler, who according to Professor Eggert, didn't know jack about programming. But they did know about compression (also they pulled a disappearing act after they finished the program). So Professor Eggert has been maintaining this program for the past couple of decades. Recently, he received a bug report for gzip that claimed gzip would, upon crash, delete the file to be compressed, and fail to output a compressed *.gz file (resulting in user data loss).

The command gzip foo compresses a file foo into another called foo.gz, and removes the original file. The outline of gzip and the system calls it uses is as follows:


			1   int ifd = open("foo", O_RDONLY)		 	//input_file_descriptor 
			2   ofd = open("foo.gz", O_WRONLY | P_CREAT | O_EXEC)	//output file descriptor 
			3   while input file is not completely read: //check for IO errors, stop if there is, and delete foo.gz 
			4       read(ifd, bf, bufsize) 
			5       compression() 
			6       write(ofd, buf, compressed_size_of_bufsize) 
			7   if (close(ifd) != 0) error() 
			8   if (close(ofd) != 0) error() 
			9   if (unlink("foo") != 0) error() 
			10  exit(0) 
		

In the program, if gzip crashes between line 10 and 11, we lose all data (both the uncompressed and original file). We do all of the correct checks though, yet according to Eggert, this bug is common in many Linux programs. To understand this bug, we must first understand how file systems are implemented, and come back to it at the end of this lecture.

File System Implementation

If we recall, the BSD file system layout can be thought of as a disk separated into data blocks. These blocks make up sectors with different uses. The sectors are as follows:

  1. Boot Sector
    This contains code to be run at boot time
  2. Super Block
    The super block contains metadata about the entire file system, including the file system type, size, status, and various other metadata. Usually there exist multiple copies of the super block in different parts of the disk, acting as backups
  3. Free Block Bitmap
    This contains a mapping of all the free/used data blocks (e), at a ratio of 1 bit per data block. (Explained below)
  4. Inodes
    This contains a mapping of all the free/used data blocks (e), at a ratio of 1 bit per data block. (Explained below)
  5. Data Blocks
    This is the bulk of the filesystem. Data blocks (usually 8 KiB) hold the actual file data.

One concern is how to allocate extra space for a file quickly. When we create a file, we give it an initial size that may be exceeded as we write to it. If we want to expand a file beyond its current size, we need to find a free data block. The naive way (First Come First Serve or FCFS) to find a free block is to linear search through the inode table and follow their data pointers to find every block that's in use. Then, we subtract the occupied data blocks to find all the free blocks. The problem with this approach is that it is SLOW. We need to search through a large portion of our file system just so we can add a single block of data to a file. A much better way makes use of the free block bitmap.

The free block bitmap is simply a bitmap that represents all of the in use blocks of the disk. One bit in the bitmap represents a block in the file system. With this approach in a BSD implementation, we need one bit per 8 KiB. If block is free, its representative bit in the bitmap will be set to 1. If it is in use, the bit is set to 0.

Now, we simply traverse the free block bitmap until we find a bit set to 1. We then know that block is free, and can easily give that block to the file that needs to be grown. Bonus points for this solution since we can plausibly cache the entire bitmap in RAM, rather than make expensive read calls to the device.

Note: The boot sectors, super blocks, inode tables, and the free block bitmap itself are represented in the free block bitmap. These blocks are always in use. If we ever tried to give a file a block from these sectors, bad things happen.

Performance Issues with Filesystems

Files are stored on the disk, so we need to complete the following steps to read a file from disk and put it into RAM. The approximate amount of time this takes is as follows:

In total, we spend ~18 ms to read a file from disk, the majority of which is overhead. Assuming our machine is clocked at 1 GHz (low estimate of 1 billion instructions per second), we can complete about 18 million instructions in 18 ms. As a result, with filesystems, we would like to minimize the amount of work the disk performs (overhead of seek time and rotational latency).

Input/Output Metrics

We'd like to measure how well our file system works, and to do that we need some metrics in order to gauge its efficacy.

  1. Utilization: This is how often your disk drive is doing useful work; what percentage of the time are you reading/writing data, not overhead
  2. Throughput: This is the rate of request completion. If you make 1000 requests and they all are done in a second, then throughput is (1000 operations/sec)
  3. Latency: This is the overall delay between request and response

In general things that are good at latency, are not good at throughput - the two are competing objectives, and you must make a tradeoff between the two.

Improving File System Performance

Scheduling Algorithms

When the block layer implements one of these Disk I/O scheduling algorithms, the file system does not see the underlying workings. These details are invisible to the file system, and the file system only sees the performance gains of the scheduling.

An example with this file system:

Consider the simple system call to rename the file "foo" to "fuzzy":


			1   rename("foo", "fuzzy")
		

This system call works as follows:

  1. The file system looks up the working directory inode (cached)
  2. We find the directory entry of "foo" in the WDI (also cached)
  3. Update the directory entry of "foo"
    1. Overwrite the name "foo" to "fuzzy"
    2. Update the name size from 3 to 5
  4. Since the WDI and directory entry were both cached, we have to find the actual data block on disk of the WDI and directory entry, and update them there too

But wait, what if the directory entry is completely full when holding "foo"? Then there is no room for the extra two characters in "fuzzy", and we must grow the directory. To do so, the process of renaming gets more complex:

  1. Get a free block, courtesy of the free block bitmap,
  2. Write fuzzy in the updated block and remove foo from the old block. Both of these actions have to be written both to cache and to disk.
  3. Since we added a block to the directory entry, we must also update the directory inode.

In total, we update four blocks: the directory inode, the original block for the directory entry, the new block for the directory entry, and the free block bitmap. It is important to note that we have to write these blocks in a particular order so that our file system does not accidentally leak free blocks, or worse, delete user data.

The safest order of writes is as follows:

  1. The new free block -- if we crash after writing the free block, we have effectively done nothing (it is still marked as free in the bitmap)
  2. The free block bitmap -- if we crash here, we'll leak disk space as a block is marked as used, but it is not associated with any inode. Therefore, that block will never be freed. Not good, but not as bad as user data loss.
  3. The directory's inode -- once we associate the block with the inode, we no longer have a free block leak. --if we crash here, we'll have two hard links to the same file, both "foo" and "fuzzy". It's not what the user wanted, but it's not the end of the world (or is it? more on this below)
  4. The old block (remove "foo" from the name) -- if we crash here, that's fine

The problem that occurs when we crash between 3 and 4 is the following: If we have 2 hard links to the same file, the link count is not updated, and remains 1. We didn't bother to update it, since we were just doing a rename. But when the user finds both hard links, and deletes one of them, the link count becomes 0, and the system reclaims the data blocks of the file, causing user data loss (this is the end of the world). To solve this, we actually have to update the link count in foo's inode from 1 to 2 before doing any of these dangerous writes, and bring it back down to 1 after they complete. In the case that the system crashes, we don't lose user data, we only have a free space leak, which isn't quite as bad.

File System Invariants

When maintaining, writing, or using a file system, it is necessary to keep track of important properties of the disk. File system invariants are statements that will always be true even if we crash (assertions that should always be true)

BSD File System Invariants:

  1. Every block is used for one purpose -- Each block in the file system must be used for one and only one of the sector types. For example, a block used for the bitmap may not be used to store data at a later time.
  2. All referenced blocks are appropriately initialized -- any block marked non-free in the bitmap must be initially set to null. If it is not, the data that existed before could be interpreted as data that is there now. This does not apply to free blocks, as the bitmap indicates that those blocks are not in use, and the data is not valid.
  3. All referenced data blocks are marked as used in the bitmap. If a data block can be reached through an inode, then it is in use and should be marked as such in the bitmap.
  4. All unreferenced data blocks are marked free in a bitmap. This is the inverse of the above.

Consequences of violating the above invariants:

  1. Violating invariant 1 above causes a variety of errors, from user data loss (if a data block was overwritten), to file system crash (if the boot sector was overwritten). Really, anything can happen depending on what was overwritten.
  2. A reference block that isn't properly initialized could be mistaken for a valid file, which may point to random places in the file system. This could lead a program to modify a disk address that it should not have, causing user data loss.
  3. If a reference data block is not marked in the bitmap, that block may be overwritten when new storage must be allocated for a file. That block will then be referenced by two different file inodes. This also leads to user data loss, as modifications to one inode will overwrite the data of the other.
  4. If the last invariance is violated, it will not cause data loss, but instead cause a storage leak. This is not as catastrophic an error as the previous three violations, but is still a nuisance for the file system, as blocks will be "lost". We will now talk about how to fix these leaks.

As we have seen, there are many avenues for data leak in the event of a system crash. In order to fix data leaks and reclaim unlabeled data blocks, we use the fsck (File System Checker) program. It bypasses all file system constructs and writes directly to disk in order to avoid the protections of the file system.

fsck finds:

In the first case, fsck simply corrects the leak by appropriately setting the free block bitmap. In the second, fsck takes the orphaned files and places them in the /usr/lost+found/ directory. As there are no parent directories for these files, the file names are lost. Instead, fsck uses the inode numbers for these files as the filenames, and waits for user to take action. Finally, permissions of these files are set to root, as they may have been lost when the files were orphaned.

Back to the bug in gzip:

Although the file system code reads and writes blocks in a safe order, the block layer does not respect any of these requirements. All reads and writes have no order guarantee at the block layer, so although the file system carefully dictates the order in which writes must occur, the block layer is free to (and probably will) write in any order to increase performance.

If our code crashes after the final unlink, before the exit() call, then both the last write to foo.gz and unlink() will be in the block layer’s request queue. These two requests can be served in any order by the block layer, and in the case that the unlink is written, followed by a crash, foo.gz will be deleted (since it’s not complete), and foo will be unlinked. This results in total data loss.

In order to work around this, the file system can be configured with the "ordered mount" option, which will insist that the block layer not use the request queue, and instead serve requests in a strict FCFS order.

Alternatively, the code can be modified by adding the following code after each write:


			1   if (fsync(ofd) != 0)    	//this function call insists that the 
			2   error()			//write to ofd be completed before moving on 
		

The tradeoff with either of these solutions is that it is slow. Very slow. We trade a lot performance for consistency. But that is the cost of maintaining user data. This fix has already been implemented in the gzip source code.

In fact, this bug is so pervasive that it affects all sorts of applications across all operating systems.