CS111 11/13/08

File System Robustness

Scribe notes by
Yajie Wang
Dean Ushijima

IEEE standard terminology for Robustness:

Error: mistake made by designer/programmar
Fault: defect in hardware or software which may cause failure
Failure: when the system behaves incorrectly

It is important to note that errors don't always lead to faults and failures (especially if they're consistent)

File system goals relative to failures:

Durability: file system should survive hardware failures (this goal is rather limited - file system can't be preserved if the computer is utterly destroyed, etc) (e.g. power failures)
Atomicity: changes to the file system are either made or not made
Performance: we want low latency and high throughput (keep the system as busy as possible

Rename example

Consider the following example: rename("a", "b")
Steps:
1. Read directory page into RAM
1b. If "b" exists, go to step 98
2. Change "a" to "b" in RAM
3. Write out modified page

Where can crashes occur? At steps 1 and 2, crashes are okay. But if a crash occurred at step 3, bad things can happen. Of course, if page writes are atomic, then a crash would not result in disaster. Other bad things can happen when a separate process is allowed to do steps 2 and 3 (in the case of optimizing aggressively so that what is on disk is not the same from the application's point of view).

Harder rename example

Now consider this example: rename("d/a", "e/b")
Steps:
1. Read d's data page into RAM
2. Read e's data page into RAM
2a. Make sure b does not exist, otherwise go to step 99
3. Alter d's page in RAM to clear entry for a / add b to RAM
4. Write d's page
5. Add b to e's page
6. Write e's page
... and eventually increment last d and e's mod time

It would be very bad if a power failure occurred somewhere in steps 4-6. If a crash occurred between steps 4 and 5, the file would be lost. To fix this problem, we should probably switch #3,4 with #5,6. However, a crash now could lead to two duplicate files with a link count of 1. Thus, we need to increment link count so that the crash would accurately create two duplicate files with link count of 2. Then after a reboot, we can do an unlink and fix the problem.

We should make the following revisions to the algorithm:
Steps:
1. Read d's data page into RAM
2. Read e's data page into RAM
2a. Make sure b does not exist, otherwise go to step 99
3. Add b to e's page
3a. Increment file's link count on disk
4. Write e's page
5. Alter d's page in RAM to clear entry for a / add b to RAM
6. Write d's page
6a. Decrement link count
... and eventually increment last d and e's mod time

Standard solution for a fast file system style file system

If this is true before, it would be true afterwards as well

Consequence File System Correctness Invariants

Disaster!                1. Every block is used for exactly one purpose (boot/superblock/bitmap/node/data)
Disaster!                2. All blocks reachable from the superblock are initialized to valid values.
Disaster!                3. All reference (reachable) blocks of data are marked 'used' in the bitmap.
Disk Leak                4. All unreachable blocks are marked 'unused'.

Disk Leak can be fixed by fsck() after reboot.

Interaction between performance + robustness

Suppose we want to do alot of writes. Usually the OS will optimize the writes but would this undo compromise the robustness of the system? Is some sort of reordering possible?

Recording write requests is common for performance

Done within disk drive + controller
This messes up careful ordering of writes (our rename example)

Fix?

Disable all performance improvements
Force ordering only if out of order execution causes Disaster!
- Writes to metadata ordered
- Writes to ordinary data reordered but not too far <--tricky

What if block writes are not atomic?

On real disks, typically:

Sector writes are atomic (512 bytes)
block/page writes are not atomic

On other devices (in general):

block contents are generally scrambled

Here are some possible methods to do writes: 2 pages to implement 1 page (Time is on vertical access doing down)

with single copy of data

X	X
?	X
Y	X
Y	?
Y	Y

with two copies of data (can crash and still mess up after step 2)

a	b	c
X	X	X
?	X	X
Y	X	X
Y	?	X
Y	Y	X
Y	Y	?
Y	Y	Y

Fix: have 3 copies. When reading, read all 3. If they are all equal, you're done. If values disagree, overwrite the minority with a == b ? a : c

Lampson-Sturgis' assumptions

Storage writes may fail or may corrupt another piece of storage.
a. Assume a later read can detect bad sectors
Storage can spontaneously decay
a. (1a) Assume a later read can detect bad sectors
Errors are rare
Assume you have enough time to repair errors

Implement atomicity above a lower level system with atomic page writes

How to build a File System

Copy new 10 blocks to copy
Copy new 10 blocks from copy to actual

On reboot, copy File System as actual File System. Ideas on how to build file systems robustly:

Idea #1: Commit Record

Write new blocks to copy disk
Wait for blocks to hit copy disk
write commit record saying "done"
Copy new blocks from copy FS to actual FS
Wait for blocks to hit actual disk
Clear commit record

Looks like 1 big atomic operation!

BEGIN
pre-commit phase (can back out without effect)
COMMIT (atomic)
post-commit phase (affects performance, not behavior)
END

Idea #2: Journaling

double entry bookkeeping (never erased) 2 copies of everything

Don't copy whole File System
Just ouput a record of the changes you intend to make

Basic Protocol

Log Update before installing it in cell memory

Write changed blocks to log
Write COMMIT record
Install changed blocks to cell memory
Clear commit record

Run a recovery procedure after crashes to replay journal

Scan log in forward order
For every commit record, do steps 1.3 and 1.4

(In practice, running out of journal space doesn't happen very much!)

Recovery

It must work even if crash is in recovery. The recovery must be "idempotent" and cell storage can be in RAM
There are two types of recovery:

Scan forward through log using Write-ahead log
- Log actions
- Commit
- Install new values into cell data
Recovery:
- Scan forward through log
- roll-forward recovery
Scan backward through log using Write-behind log
- Log old values
- Install new values
- Commit
Recovery:
- Scan backward through log
- Roll back recovery

Performance problem

Save() takes a long time for data to hit disk. As a solution, many systems optimize-->

save() returns
disk writes are scheduled, but not done

One workaround: delay "external" actions/requests (such as graphics) until save() is actually done
another way: a later action might unexpectedly fail. Then we would have cascading aborts and compensating actions to handle the disk writes.

Look forward to more on disk failures in the next lecture!