CS111 11/13/08

File System Robustness

Scribe notes by
Yajie Wang
Dean Ushijima

IEEE standard terminology for Robustness:

It is important to note that errors don't always lead to faults and failures (especially if they're consistent)

File system goals relative to failures:

Rename example

Consider the following example: rename("a", "b")
Steps:
1. Read directory page into RAM
1b. If "b" exists, go to step 98
2. Change "a" to "b" in RAM
3. Write out modified page

Where can crashes occur? At steps 1 and 2, crashes are okay. But if a crash occurred at step 3, bad things can happen. Of course, if page writes are atomic, then a crash would not result in disaster. Other bad things can happen when a separate process is allowed to do steps 2 and 3 (in the case of optimizing aggressively so that what is on disk is not the same from the application's point of view).

Harder rename example

Now consider this example: rename("d/a", "e/b")
Steps:
1. Read d's data page into RAM
2. Read e's data page into RAM
2a. Make sure b does not exist, otherwise go to step 99
3. Alter d's page in RAM to clear entry for a / add b to RAM
4. Write d's page
5. Add b to e's page
6. Write e's page
... and eventually increment last d and e's mod time

It would be very bad if a power failure occurred somewhere in steps 4-6. If a crash occurred between steps 4 and 5, the file would be lost. To fix this problem, we should probably switch #3,4 with #5,6. However, a crash now could lead to two duplicate files with a link count of 1. Thus, we need to increment link count so that the crash would accurately create two duplicate files with link count of 2. Then after a reboot, we can do an unlink and fix the problem.

We should make the following revisions to the algorithm:
Steps:
1. Read d's data page into RAM
2. Read e's data page into RAM
2a. Make sure b does not exist, otherwise go to step 99
3. Add b to e's page
3a. Increment file's link count on disk
4. Write e's page
5. Alter d's page in RAM to clear entry for a / add b to RAM
6. Write d's page
6a. Decrement link count

... and eventually increment last d and e's mod time

Standard solution for a fast file system style file system

If this is true before, it would be true afterwards as well

Consequence         File System Correctness Invariants

Disaster!                1. Every block is used for exactly one purpose (boot/superblock/bitmap/node/data)
Disaster!                2. All blocks reachable from the superblock are initialized to valid values.
Disaster!                3. All reference (reachable) blocks of data are marked 'used' in the bitmap.
Disk Leak                4. All unreachable blocks are marked 'unused'.

Disk Leak can be fixed by fsck() after reboot.

Interaction between performance + robustness

Suppose we want to do alot of writes. Usually the OS will optimize the writes but would this undo compromise the robustness of the system? Is some sort of reordering possible?

Recording write requests is common for performance

Fix?

What if block writes are not atomic?

On real disks, typically:

On other devices (in general):

Here are some possible methods to do writes: 2 pages to implement 1 page (Time is on vertical access doing down)

X
?
y

with single copy of data

X X
? X
Y X
Y ?
Y Y

with two copies of data (can crash and still mess up after step 2)

a b c
X X X
? X X
Y X X
Y ? X
Y Y X
Y Y ?
Y Y Y

Fix: have 3 copies. When reading, read all 3. If they are all equal, you're done. If values disagree, overwrite the minority with a == b ? a : c

Lampson-Sturgis' assumptions

  1. Storage writes may fail or may corrupt another piece of storage.
    a. Assume a later read can detect bad sectors
  2. Storage can spontaneously decay
    a. (1a) Assume a later read can detect bad sectors
  3. Errors are rare
  4. Assume you have enough time to repair errors

Implement atomicity above a lower level system with atomic page writes

How to build a File System

  1. Copy new 10 blocks to copy
  2. Copy new 10 blocks from copy to actual

On reboot, copy File System as actual File System. Ideas on how to build file systems robustly:

Idea #1: Commit Record

  1. Write new blocks to copy disk
  2. Wait for blocks to hit copy disk
  3. write commit record saying "done"
  4. Copy new blocks from copy FS to actual FS
  5. Wait for blocks to hit actual disk
  6. Clear commit record

Looks like 1 big atomic operation!

BEGIN
pre-commit phase (can back out without effect)
COMMIT (atomic)
post-commit phase (affects performance, not behavior)
END

Idea #2: Journaling

double entry bookkeeping (never erased) 2 copies of everything

Basic Protocol

Log Update before installing it in cell memory

  1. Write changed blocks to log
  2. Write COMMIT record
  3. Install changed blocks to cell memory
  4. Clear commit record

Run a recovery procedure after crashes to replay journal

  1. Scan log in forward order
  2. For every commit record, do steps 1.3 and 1.4

(In practice, running out of journal space doesn't happen very much!)

Recovery

It must work even if crash is in recovery. The recovery must be "idempotent" and cell storage can be in RAM
There are two types of recovery:

  1. Scan forward through log using Write-ahead log Recovery:
  2. Scan backward through log using Write-behind log Recovery:

Performance problem

Save() takes a long time for data to hit disk. As a solution, many systems optimize-->

One workaround: delay "external" actions/requests (such as graphics) until save() is actually done
another way: a later action might unexpectedly fail. Then we would have cascading aborts and compensating actions to handle the disk writes.

Look forward to more on disk failures in the next lecture!