CS 111: Operating System Principles

Lecture 13: File System Robustness

Scribe: Uen-Tao Wang

Table of Contents

1. Overview of File System Goals

2. Durability

o Defining Failure

o Models for Failure

o A Simple Model for Power Loss

3. Assumptions

o Lampson-Sturgis Assumptions

o CS 111 Assumptions

4. Invariants

5. Ideas for Robust File Systems

o Commit Record

o Journaling

1. Overview of File System Goals

The previous lecture briefly covered the goals of file systems. These include:

Performance:

High throughput, or the ability to handle many requests per second
Low latency, or the ability to handle requests it as quickly as possible

Durability:

Should the underlying system fail, the data in the file system should survive

There is a third goal that was not mentioned in the previous lecture, but it is worth mentioning that we are looking for:

Correctness:

With correctness, the emphasis is on atomicity.
When you make a change to the file system (write/delete/rename), you want the change to either be done or not done. You do not want change to be partly made.
To some extent atomicity interacts with durability. We want the system to be atomic even in the presence of failure.

2. Durability

Here is an "underlying system":

Important Note: This laptop is very old

What can go wrong with a system when dealing with durability? What if the keyboard on this old laptop dies? The keyboard is connected to the CPU via a bus and the bus is also connected to the RAM and the Disk Controller. Any one of these things could die. Additionally, one could take this laptop and drop it on the floor. Any aspect of this system could fail, but the goal of durability is NOT to survive every possible occurence. You cannot take a sledgehammer to the laptop and use the resulting debris to claim that durability has not been not solved. The key is that we are trying to account for limited failure, rather than arbitrary failure.

The failures that are likely (in a seven year old laptop):

The battery could die
A cosmic ray could flip a bit in RAM.
- Depending on the laptop, this could happen once a week to once a day.
- Of course, if you are running Windows, you'll say, "ah... it's just Windows".

In particular, we are most concerned about the failures that are most likely and the ones that will cause damage. To account for such failures, we want a model that corresponds to the real world and is simple. The type of model we choose will prove to be crucial, because it will affect our design. For demonstrative purposes, we will propose a simple model based on Prof. Eggert's laptop. Pretend that Prof. Eggert's laptop is so old that the battery does not work, which means that it must be connected to a power source at all times. In addition, pretend that the only failure that could occur is a loss of power.

Based on this model of failure, suppose that we have a file F that has contents XXYYZZ and we want to edit is so that it contains XXAYYZZ. Let us assume there is unused space at the end of the data block. At a low level, we will want to do a write_sector. However, suppose that right in the middle of write_sector, we lose power. What is going to happen?

"This is a good time for some Coke, while you're thinking."

-Prof. Paul Eggert, 2/27/2013

Such as system will have a power supply where even if it is no longer plugged into the wall, it may have enough power to run the computer for a little bit. This may allow the for the operation to complete, or can possibly allow CPU to trap with a power failure. Will this fix the problem? What will happen is one of two things:

You might get lucky and the write may still work.
If power fails right when the write heads are changing the magnetism of the sector and there is not enough power to finish writing to the sector, you will have a bad/partially written sector.

What will actually happen depends on the system, but we would like a simple model that accounts for both of these kinds of behaviors. Assuming writing "Y" to a sector which has "X" looks like this:

Time	0	1	2
Data in Sector	X	?	Y

During time unit 1, the data in the sector is uncertain. We can assume that if you were to attempt to access a bad sector, you will know about it. Unfortunately, other I/O devices do not necessarily have this ability and consequently, the data during time 1 could be arbitrary if read by some other devices. In this system, while we are writing to F, the contents of F are temporarily "completely bogus". This means that if we pull the plug, not only have we lost the write, but we have lost our data. Belive it or not, this will probably be an issue.

"Here's how we solve the problem. We solve the problem by throwing more disk space at the problem"

-Prof. Paul Eggert, 2/27/2013

We start off with the file F. In our file system, however, we will maintain two copies, F1 and F2. We follow the golden rule of durability, which is "never overwrite our last copy of something". Again assume F has "X" and we want to write "Y". We start by writing Y to F2 first, then to F1. The resulting instructions will look like this:

Time	0	1	2
F1	X	X	Y
F2	?	Y	Y

With this method, at no time have we lost the data. Either we have X sitting in F1 or Y sitting in file F2 at all times. We have not lost our data, but we still have a problem. There is no marker telling whether or not the correct data is F1 or F2. We can instead employ a marker file and write to the marker file in order to indicate whether or not the write has completed successfully. If the marker file contains 2, the data to use is in file 2 and if the marker file contains 1, the data to use is in file 1. The order of operations is thus:

F2 = Y;

Marker = 2;

Now that there is information that confirms that the new data is in F2, we can actually optimize and stop here and complete the write in three steps.

Time	0	1	2
F1	X	X	X
F2	?	Y	Y
Marker	1	1	2

The problem with this? This is heavy duty overhead. We have effectively solved the problem of writing to one sector by writing to three sectors. How can we speed this up somehow? Can we do it with two writes, maybe by cheating the system?

Here is a possible solution. At the end of the sector that has X written to it, we are going to cheat and write a little bit of the timestamp at the end of the sector. We can then solve the problem by writing Y to F2, along with the current time, then check which one has the most recent date modified when we want to access the data. This reduces the process to only one write and actually too good to be true. Where did we go wrong? In the middle of the write, F2 can be any arbitrary data, concatenated with the timestamp. This is not quite enough to ensure robustness.

The classic solution to the problem is actually to maintain three copies, F1, F2, and F3, each with say, A. When you write, you write to all three copies, one at a time. Let's say we want to write "B".

Time	0	1	2	3	4	5	6
F1	A	?	B	B	B	B	B
F2	A	A	A	?	B	B	B
F3	A	A	A	A	A	?	B

The advantage of this system lies in how you can safely recover the data. After a crash, the rule is:

if (F1 == F2)

return F1

else

return F3

This system will consistantly return the old data during times 0 to 3 and the new data during times 3 to 6. This sort of solution, while classic, drives people up the wall. We need three copies... of EVERYTHING. You gotta be kidding me. I don't have that kind of money. Even if we were to reduce it to two copies and a flag, the flag will still require a sector anyway. Ideally, we are aiming for a better solution, specifically one that requires fewer than three copies. Because we are not able to solve the problem with our model, we much abandon our underlying model of failure and choose a new one that is more appropriate to our system.

3. Assumptions

Our previous model was very simple and applies to many I/O devices, but because there were so few assumptions, it was difficult to create a design based on the model. We must make more assumptions about how the devices behave. A classic set of assumptions for such a case are:

The Lampson-Sturgis Assumptions

Storage writes may fail. When you write to a sector on the disk, the write might not work. As a result, it can corrupt either the written sector or in more severe cases, a sector that is physically near the disk arm.
Storage can decay spontaneously.
In both cases however, assume a later read can detect the bad sector.
Errors are relatively rare.
Corrolary: assume repairs can be done in time.

Let's apply these assumptions to a more concrete example, in particular the syscall rename:

rename("a", "b")

Recall from previous lectures the general structure of BSD file systems. The process will contain an inode pointing to the working directory. This inode will point to data blocks that represent the contents of the directory by means of a series of directory entries. One of the entries has the name of "a". Based on this, rename will work as follows:

Read the working directory inode into memory
Read the data block of the directory into memory
Look through directory entries for "a" and rename to "b" in memory
Write the data block out to disk
Update the "last modified" timestamp in the directory inode in memory
Write the inode to disk.

What can go wrong? Suppose the system crashes between steps 4 and 5. After the reboot, it will be evident that the rename succeeded but the timestamp will be old. This is probably not that severe an error. Suppose the system crashes between steps 3 and 4. It will appear as though the rename did not occur which again, is not a catastrophic error. It appears that under this model, the rename syscall is fairly safe.

However, note that in addressing this syscall, we have made more assumptions than the Lampson-Strugis assumptions, which means that we effectively have a different storage model. We are making the additional assumption that everytime we write a data block, it is an atomic operation. Nowadays, that is not totally unreasonable so let us instread revise the Lampson-Sturgis Assumptions into the CS111 assumptions, which include all of the previous assumptions in addition to the assumption that a single data block write is atomic. A write is everything or nothing. Let's revisit rename and suppose that the user wants to cause some trouble with the following command:

rename("a", "bb");

If we are lucky, there will be space at end of direntory entry, but it could be that there is no room. At this point, the system would scurry around in data block looking for a hole somewhere else that is big enough to hold the slightly bigger entry, and then make a copy. This will not work in general because sometimes the entire data block is full. A reasonable solution would be to enforce a fixed file name size which do occur in practice. Now let us consider:

rename("d/a" "e/a")

Now we have two different directories with two different inodes. What set of actions can we do to make this work? Bear in mind that we want a system whereby if we fail in the middle of this operation, things still look halfway decent. Assume for now that there is enough space in the second directory for the new entry. The following is a possible method:

Read the directory inodes and their corresponding data into memory
Write the inode for "a" into the data for "e" and then write it to disk
Update the directory entry for "d" and write it to disk
Write the inode for "e" to disk
Write the inode for "d" to disk

Note that 1 should occur before 2 always so that if it crashes, we will always have a copy of the data. What's the catch? Notice that after we do step 1, we have technically added a link to file "a", but it has not yet been recorded in the inode for "a". This means that if the system crashes before step 4 and we try to remove the corrupted copy, we will have lost the data. Here is a revised method to account for this.

Read the directory inodes and their corresponding data into memory
Update the link count of inode "a" to 2
Write the inode for "a" into the data for "e" and then write it to disk
Update the directory entry for "d" and write it to disk
Update the link count of inode "a" to 1
Write the inode for "e" to disk
Write the inode for "d" to disk

This seems to resolve the previous problem but there still exists a problem with this approach: sometimes the link count will be too high. If we crash between steps 2 and 3. The link count will be 2, but true number of links will be 1. This will cause a memory leak which is undesirable but from a user's perspective, losing data is far more severe a concern. When writing a file system, you must constantly consider what happens if the system fails at any point in time. The purpose of this example is to demonstrate the importance of choosing an appropriate model for failure, but let's take this the next level up. In practice, not only do you need a model for how your underlying sytem can fail, but you also need to consider aspects of the system that must always be true. We need... invariants.

4. Invariants

Invariants are defined as conditions that must be true in order for your system to operate. This means that you must ensure that every time you change the data structure in the file system, the invariants become true. For example, consider a BSD style file system which has a superblock, a bit-map, an inode table, and data blocks which can contain directories, indirect blocks, etc. A reasonable set of invariants for this system are:

Every block in the file system is used for exactly one purpose
All referenced blocks are properly initialized with data appropriate for their type
All referenced data blocks are marked as used in the bit-map
All unreferenced data blocks are marked as free in the bit-map

We can establish more, but the important point is that knowing the invariants are a critical part of design a file system; consider them carefully to ensure that they are indeed the right set of invariants. Invariants are can be considered analogous to laws and like laws, the practical significance of invariants are the penalties that occur when they are violated. Consider the consequences of violating the above invariants. As an analogy, compare the limitations of an invariant to the speed law that prohibits cars from traveling faster than, for example, 65 miles per hour. What occurs if you violate the above invariants:

Disaster. The entire file system becomes corrupted. A penalty of this severity is the akin to saying that if you go at 66 miles an hour, your car explodes as a penalty. This is effective enforcement.
Disaster. The file system is again corrupted. Your car will also explode in this situation.
Disaster. When you need to allocate data, you will have two files that think it owns the data block which again leads to corruption.
Memory leak. Not a disaster, a speeding ticket rather than instant death.

If you are then thinking about simplifying the code, you should look at invariants and see what invariant is least critical/most flexible. In this case, you could maybe change invariant 4 to, for example.

4. All unreferenced data blocks are marked free OR are about to be marked free.

The point is, we need to carefully consider what invariants to enforce and what could be the consequence of violating them.

5. Additional Ideas for Robust File Systems.

In addition to the above topics, there are several more concepts that are implemented in order to ensure the robustness of a file system. For example, a complex command could require several atomic commands to execute, but we want to make sure that from the user's perspective that the entire command appears atomic. This is to say that if a command that consists of several atomic commands fails in the middle of execution, it appears to the user that either all of the sub commands were executed, or none of them. To implement a system such as this, we can use something called a Commit Record.

Commit Record:

A commit record is seperate record that is used to mark whether an action that we want to appear to the user as atomic has actually been done. The way we do this is with an extra write. Say there are four writes needed for the completion of a command. At the end, we would add an additional write to the commit record and if the last commit write did not occur, the previous actions do not count. The following is an example for how to do commits:

BEGIN
Pre-commit phase: Execute the writes to data, but they will appear invisible to the upper level. If we crash now, it will be as if nothing happened.
COMMIT OR ABORT
Post-commit phase: Execute additional bookkeeping or cleaning if the command was aborted. This is again, invisible to upper level but can serve to help performance later on.
END

From user's point of view, the command does not appear to change the disk unless the commit occurs, which means that it appears atomic. Note that since you haven't really commited any changes until the COMMIT phase, you can change your mind. You can allow for aborts during the pre-commit phase and if an abort occurs, the system must clean up the changes made. Using a previous example, what if we attempt to do the following using a commit record and find that "c" is out of disk space:

rename("a/b", "c/b")

Even if many changes are made before it is discovered that c is full, the presense of a commit record will allow you to abort and undo the changes made, preserving higher level atomicity. To efficiently implement a commit record in our file system, we may also use Journaling.

Journaling:

We are trying to figure out how to make updates to file system efficiently and we have a system involving commit records in which we are very careful about the order. A proposed method is to have an journal oriented file system that is append-only. This is to say that the file system is an array data structure that is constantly growing. The only way to modify the file system or make a change to a file is to append a new block and to analyze the file system from left to right. This is an obviously impractical system that will waste storage immensely, but if your I/O is mostly writes, this will be an efficient system since there will be no seeks; you will always be writing to the end of the disk.

Instead, let us employ this idea of an append-only journal and use it in conjunction with an ordinary file system. When you want to make changes to the file system, you log the change first into the journal. At the end of the sequence of commands, you put a commit record. Later, you can copy the data into the file system and mark it as "done" in the journal. If system crashes between journal commit and storing the changes in the actual disk, upon reboot, you can example the journal and examine the previous actions and simply do what you were planning to do anyway, then reclaim journal segments that are done.

The important thing with recovering the previous command is that the recovery must be idempotent because the recovery process can crash too. Additionally, if a journal sequence ends in an abort, the preceding actions are ones that the system will NOT execute.

What's described above is a "write-ahead" log. You log what you do, commit, then copy the new state into the disk. Another method of doing this is "write-behind", in which you log the old disk contents, write the new disk contents, then commit. The idea here is that the log contains old version of data, rather than the new. As a result, if there is no commit, the old version is recoverable. Both approaches work.