CS111 Scribe Notes · Lecture 15 · Ryan Dall & David Peatman

What can go wrong with hardware?

There are many things that can go wrong with any given piece of hardware. Below is a subset of problems, all of which could cause significant problems with the operation of a system.

Power loss
Running out of space
Running under a heavy load
Sudden disconnects
Bad sectors/tracks
Disk removed
Explosions

Some of these problems will lead to much more significant system damage than others. However, for this lecture we choose to focus on power loss as it is the simplest catastrophic failure possible.

Examining power loss

One common solution to prevent such a catastrophic loss would be to buy a cheap external power source. Power would be provided for long enough for all processes to safely exit and shutdown to occur. For the purposes of this discussion we will assume that this is not the case. In this scenario, the loss of power would be near instantaneous from the perspective of a user. However, the processor will still have a handful of microseconds in which it can perform actions. This time is normally used to take actions to prevent permanent damage to the system, such as pulling the disk arms off of the drive.

But what about other processes that are occurring at the same time as the failure? We'd like to assume actions that could change memory, such as a write, are atomic. However, this is not always true

Case Study: Zhang et. al. - Fast 13: Understanding Robustness of SSDs under Power Fault

In this recent study, 15 normal consumer solid state drives were tested as to what would happen when subjected to a power failure in the middle of operation. As a result of this testing, only 2 of the drives were completely functional in an error-free state. The remaining 13 drives exhibited one or more of the following problems:

Bit corruption
Flying writes
- Sector written in wrong location due to wear levelling
Shorn writes
- Write only partially complete
Meta-data corruption
Dead device
Unserializability
- Random pieces of the write are completed out of order
  - write("A B C D E F") gives "A C D F"

The moral of the story is that you should always test your flash memory prior to using it for anything sensitive. In the effort to push performance to the limit for these drives, flaws with robustness have been introduced. These flaws then cause catastrophic failures when faced with abnormal circumstances.

How can we solve this these types of problems with software?

Suppose we're trying to implement the ^x^s functionality of Emacs. To do so we need to overwrite the old version of the file in memory (version 0) with our new updated copy (version 1). The simple answer would be to call write and go about our business. But what happens if power crashes during this write? If we have no OS to defend against this problem, massive amount of data can be lost during this write, or necessary data can be overwritten with complete garbage. This prompts the Golden Rule of Atomicity: never overwrite your only copy.

How could we solve this data loss? By introducing a second device! The file can be written to device A while the original copy exists on device B, and then after the write completes the file can be copied to B. This prevents any loss of data as a good copy of the file must always exist! But consider what would happen in this scenario if power failed at one of the internal steps:

Device	Step 1	Step 2	Step 3	Step 4	Step 5
A	0	?	1	1	1
B	0	0	0	?	1

The states of each device are correct at steps 1 and 5 and a good copy still exists if one device contains garbage in steps 2 and 4. However, a power fault which arrives at step 3 will lead to confusion as to which device is right. You have both an old and new copy with no way for the system to realize which is correct on reboot. What can we do? Introduce a 3rd device!

Device	Step 1	Step 2	Step 3	Step 4	Step 5	Step 6
A	?	1	1	1	1	1
B	0	0	?	1	1	1
C	0	0	0	0	?	1

This would solve the problem of bit corruption and shorn writes, plus we always know which version is correct! If power fails at any point, the correct version is the one that exists on the majority of the devices. If all three disagree, then A is correct. But this system is horrible, no one will use it on a production system! Why would anyone want to maintain three sets of memory? To truly solve these problems we must improve our model.

The Lampson-Sturgis assumptions

For reasons outside the scope of this lecture, the following assumptions can be used in order to design a system which would prevent all 6 problems demonstrated in the case study above.

Storage writes may fail or corrupt another piece of storage
A later read can detect bad sectors
Storage may spontaneously decay
Errors are rare
Repairs can be performed prior to subsequent failures

However, as this explanation is tricky, the remainder of this lecture will be performed with the added simplicity from The Eggert Assumption: all block writes are atomic

Problems with the Eggert Assumption

This assumption will simplify things greatly. However, it is not perfect by any stretch. Consider a BSD file system where file meta-data is stored in inodes, apart from the data contained in the file.

For any high-level write performed, two changes need to be made to memory; the meta-data in the inode must be updated, and the actual data must be overwritten. As the inode block is seperated from the data blocks by a significant distance, what happens if power is lost as the arm is tracking between the two? Depending on the order of the writes...

Either the data has been written and the meta-data hasn't
- Data could be changed without reflecting such a change in things such as the time stamp
Or the meta-data is written and the data isn't correct
- This would cause problems such as the time stamp indicating a change, while the data remains the same

In addition, a common optimization to this process for large or multiple writes is to dally updating the inode until all the data blocks have been written in order to prevent the disk arm from seeking an excessive amount of the time. If power is lost in the middle of this, the data of multiple files could be incorrect at reboot! Furthermore, consider the procedure of renaming a file with a command such as rename("d/a", "d/b")

Find d's inode
Find d's data
Find the entry for a
change it to b
write it to disk

This is a simple process that only involves overwriting directory d's "a" entry with "b". However, renaming to something longer than one character could cause new block allocation and writes with more potential for data corruption on power loss in a similar manner to the previous example. Now consider the procedure renaming a file and directory (rename("d/a" "e/b")). Depending on the order that the names are overwritten..

Change directory name then file name
- If power fails after the directory is renamed, the file becomes unreachable - complete data loss
Change file name then directory name
- If power fails after the file is renamed, a and b both reference the same file (like hard links) - much less serious

Taken all together, these possible problems require the implementation of invariants in addition to the Eggert Assumption for correct behaviour

Invariants of a File System

For a given file system, the invariants of the system (basic truths) have to be as follows

Every block of the file system is used for only one purpose
All blocks are initialized to appropriate data types
All utilized blocks are marked as used in the bitmap
All un-utilized blocks are marked as free in the bitmap

The consequences of violating the corresponding invariant above are:

Whole system can be corrupted (Disaster)
Random things are pointed to (Disaster)
Overwrite necessary information (Disaster)
Storage leaks (Bad, but not terrible)

It follows from these lists that the consequences of violating some invariants are much worse than others. However, for normal functionality all must be observed completely

Commit Records

The goal of commit records is to perform non-atomic functions (writes) in a manner which would appear to atomic to a high-level user. Consider the following procedure:

pre-commit phase - Work is being done in temp files, always can be reversed
commit or abort - atomically write from temp files to actual location or delete temp files
post-commit phase - work is done, clean up the workspace.

From the user perspective, the write is now atomic. If a job is cancelled or power is lost prior to the commit, the state of the intended destination is completely unchanged and all previous work can simply be forgotten and cleaned by function call. After the commit, all work done does not alter the state of the destination file and the function is therefore, by definition, atomic. This gives us a way to implement something which satisfies the Eggert Assumption

Journalling

The general idea of journalling is to create a record of each change that is made to the system. Therefore, when recovering from something such as power loss all current changes can be recreated or reverted. This can be implemented by the creation of two data structures: the "cell block" where each cell represents the changes made to a memory block and the journal where changes are logged. By following this procedure...

Append a begin entry to the journal and mark changes to be made in reference to cell block
Append a commit entry to the journal and write the changes to disk
Append a complete entry to journal once write completes

recovering from a reboot becomes as simple as reading through the journal from left to right and verifying whether the changes have been completed or not.

Hybrid System

With journalling, writes are fast but reading is slow as there is constant loading of cell blocks from memory. A solution to this would be to load the cell block into RAM, but as it is implemented it is much to large. Therefore, only commonly used cells are loaded into RAM at any given time. This turns out to be a very effective system and, combined with atomic writes, solves the six enumerated problems!

This idea is implemented in production systems following one of two protocols. Write-ahead protocol logs all planned changed and then marks them done after their implementation is complete. This log may then be read from left to right in order to recover upon a reboot. In contrast, Write-behind protocol logs old data that is about to be overwritten and then logs a complete event after the write finishes. This results in a log which is read from right to left on reboot. Of the two, write-behind tends to be the faster in recovering a lost state as it does not need to verify as many entries as write-ahead.