Spring 2012 CS111 scribe notes, Lecture 13

A system is formed by a number of components, and these lower level subsystems may be reliable or not. We have two ways to make a system reliable at our level:

Challenges to system reliability

There are many events that can crash a system, and they can be categorized into external events and internal events. For a very simple desktop system (with single disk, CPU, RAM, network, etc.), the bad events are:

Atomic write example 1: file saving

Imagine we have a simple file "cs111grade.txt". For simplicity, its size is always 20,000 bytes.

If a power failure happens when we are writing...

One failed attempt to fix by having two copies of each file

Clearly if we have the copy of the original file, then we can fix the inconsistency problem in writes. This method keeps one more copy of the file we are writing, and nothing else. Here's how it works. Let's call the old version A and new version B. The modification is made in the file called TMP, and we try to copy TMP to another file MAIN. The high level representation of the whole process is:

However, this approach doesn't work. The low level implementation of this write is:

Notice in the first four phases, the two files are different. After a power failure and reboot, if we find MAIN is different from TMP, we can't tell which phase we are in, so the data recovery is still impossible.

	Content
Phase	1	2	3
MAIN	A	A	B
TMP	?	B	B

	Content
Phase	1	2	3	4	5
MAIN	A	A	A	?	B
TMP	?	?	B	B	B

One successful fix by keeping three copies of each file

Here is how this method works. The three files are called MAIN, TMP1, TMP2. The user makes modifications in TMP1, and later MAIN is updated with the aid of TMP2.

The recover procedure after reboot is "majority rules": when there are two identical copies, take the majority. Otherwise (phase 4), take TMP1. This will guarantee we get "A" or "B", never "?". All-or-nothing atomicity is achieved.

The downside: this procedure is not ideal for large files: get 3 copies of a large file will be 3 times slower and more expensive.

	Content
Phase	1	2	3	4	5	6	7
MAIN	A	A	A	A	A	?	B
TMP1	A	?	B	B	B	B	B
TMP2	A	A	A	?	B	B	B

Two ideas to improve on the 3-copy approach

Idea 1: a COMMIT record

Keep a COMMIT record of size 512 bytes (the size of a disk sector), and we assume writing to a single sector is atomic. The COMMIT holds only 0 and 1. Then the phase table looks like:

	Content
Phase	1	2	3	4	5	6	7
MAIN	A	A	A	A	?	B	B
TMP	A	?	B	B	B	B	B
COMMIT	0	0	0	1	1	1	0

If COMMIT==0, then MAIN contains the consistent data; if COMMIT==1, then TMP contains the consistent data. We can see the actual COMMIT action is in phase 4. Also phase 7 is optional, and it's no longer important to distinguish between MAIN and TMP, since COMMIT tells us about which file to look at.

Here we made an assumption that "COMMIT" (small write) is atomic. This, actually, is part of what's known as "Lampson - Sturgis assumptions", a good set of assumptions when designing a file system:

Lampson - Sturgis assumptions

Storage writes may fail, and may corrupt other blocks.
Later reads can detect bad writes.
Storage can spontaneously decay (later reads can't detect).
Errors are rare.
Individual sector writes are atomic.

This approach is an application of the general pattern for implementing a high level all-or-nothing operation(write), atop a system in which only some ops are atomic(sector writes):

Algorithm atomic-do:

BEGIN (conceptual)
     Pre-commit actions (not permanent: can back out, leaving no trace)		// phase 1-3
     Commit action: atomic at lower level, make the higher level action happen  // phase 4
     Post commit phase (we can't change our mind here,				// phase 5-7
                         but may improve performance by setting things up for
                         the next operation, etc.)
DONE (conceptual)

Idea 2: journaling

The idea of this type of methods are taken from double-entry bookkeeping: never erase anything from your ledger. The journal storage is append-only and we don't need to put cell and journal on the same drive. Also the cell storage can be cached into RAM to improve performance. Here's one picture showing that a file system divided into two parts: journal and cell storage.

1. Write-ahead logging protocol

Say we want to change block 27's content from A to B. Here is one simple logging protocol: (write-ahead log)

log into journal our intent; i.e, "to change block #27 to B"
write a COMMIT to journal
write changed blocks to cell storage
clear commit record on finish

Recovery procedure after power failure:

scan journal from left to right; look for commits that are not DONE
apply them one by one

2. Write-behind logging protocol

Here is an alternate logging protocol: (write-behind log)

log into journal our intent; i.e, "#27 has A"
install "B" in the cell storage
COMMIT

Recovery procedure after power failure:

scan backwards through log
look for operations not committed
undo one by one as you go

3. Comparing the two logging protocols

Protocol 2 is faster because in recovery, we start reading from the newest entries, and start fixing damages immediately. This is especially good for long journals. The downside is: the individual operation is slower, since we must store the old value into the journal every time for the rare recovery.

Atomic write example 2: changing file names

This picture shows the inode and data blocks of a directory file. (Regular files are organized in a similar way, but their data blocks contains other things than directory entries)

Changing file names also involves writing. If we the new filename is still in the original directory, we only have to modify one directory file, and this can be solved using the atomic write mentioned above.

Otherwise, we must write to two files. Here is one method to write, but this method is subject to data loss, which is considered a disastrous consequence. Suppose the high level operation is:

However, if the power is down right after step 7, before g is created, then the file is missing after reboot! We must invert these two steps.

Now if the power is down after step 7, f and g both exist in the file system after reboot. However, the link count of the file is only 1: after the user removes either one of them, opening the other will cause disaster. Therefore we must also update link count of the file. This method will prevent the above-mentioned situations from happening:

We can see that before step 8, everything is in RAM and not commited. If system creashes between 9 and 10, then we hold two hardlinks to one file with link count 2, and this is valid; if system crashes between 8,9 or between 10,11, then we have a file with link count 2 and one hardlink pointing to it. If later a user try to remove this file, then the inode has link count 1 but user can't see it anymore: we have a storage leak! However, compared to data loss, storage leak is not a very serious problem, and can be fixed with a file system checker program (fsck). See lecture 14 for details.

UCLA CS111 Scribe Note (Lecture 13, May 17th, Spring 2012)

Topic: File System Robustness -- Reliable System

Summary of contents

Overview on file system reliability

Two ways to make a system reliable