Scribe notes by
Yajie Wang
Dean Ushijima
It is important to note that errors don't always lead to faults and failures (especially if they're consistent)
Consider the following example: rename("a", "b")
Steps:
1. Read directory page into RAM
1b. If "b" exists, go to step 98
2. Change "a" to "b" in RAM
3. Write out modified page
Where can crashes occur? At steps 1 and 2, crashes are okay. But if a crash occurred at step 3, bad things can happen. Of course, if page writes are atomic, then a crash would not result in disaster. Other bad things can happen when a separate process is allowed to do steps 2 and 3 (in the case of optimizing aggressively so that what is on disk is not the same from the application's point of view).
Now consider this example: rename("d/a", "e/b")
Steps:
1. Read d's data page into RAM
2. Read e's data page into RAM
2a. Make sure b does not exist, otherwise go to step 99
3. Alter d's page in RAM to clear entry for a / add b to RAM
4. Write d's page
5. Add b to e's page
6. Write e's page
... and eventually increment last d and e's mod time
It would be very bad if a power failure occurred somewhere in steps 4-6. If a crash occurred between steps 4 and 5, the file would be lost. To fix this problem, we should probably switch #3,4 with #5,6. However, a crash now could lead to two duplicate files with a link count of 1. Thus, we need to increment link count so that the crash would accurately create two duplicate files with link count of 2. Then after a reboot, we can do an unlink and fix the problem.
We should make the following revisions to the algorithm:
Steps:
1. Read d's data page into RAM
2. Read e's data page into RAM
2a. Make sure b does not exist, otherwise go to step 99
3. Add b to e's page
3a. Increment file's link count on disk
4. Write e's page
5. Alter d's page in RAM to clear entry for a / add b to RAM
6. Write d's page
6a. Decrement link count
... and eventually increment last d and e's mod time
If this is true before, it would be true afterwards as well
Consequence File System Correctness Invariants
Disaster!
1. Every block is used for exactly one purpose (boot/superblock/bitmap/node/data)
Disaster!
2. All blocks reachable from the superblock are initialized to valid values.
Disaster!
3. All reference (reachable) blocks of data are marked 'used' in the bitmap.
Disk Leak
4. All unreachable blocks are marked 'unused'.
Disk Leak can be fixed by fsck() after reboot.
Suppose we want to do alot of writes. Usually the OS will optimize the writes but would this undo compromise the robustness of the system? Is some sort of reordering possible?
Recording write requests is common for performance
Fix?
On real disks, typically:
On other devices (in general):
Here are some possible methods to do writes: 2 pages to implement 1 page (Time is on vertical access doing down)
X |
? |
y |
with single copy of data
X | X |
? | X |
Y | X |
Y | ? |
Y | Y |
with two copies of data (can crash and still mess up after step 2)
a | b | c |
X | X | X |
? | X | X |
Y | X | X |
Y | ? | X |
Y | Y | X |
Y | Y | ? |
Y | Y | Y |
Fix: have 3 copies. When reading, read all 3. If they are all equal, you're done. If values disagree, overwrite the minority with a == b ? a : c
Implement atomicity above a lower level system with atomic page writes
On reboot, copy File System as actual File System. Ideas on how to build file systems robustly:
Looks like 1 big atomic operation!
BEGIN
pre-commit phase (can back out without effect)
COMMIT (atomic)
post-commit phase (affects performance, not behavior)
END
double entry bookkeeping (never erased) 2 copies of everything
Basic Protocol
Log Update before installing it in cell memory
Run a recovery procedure after crashes to replay journal
(In practice, running out of journal space doesn't happen very much!)
Recovery
It must work even if crash is in recovery. The recovery must be "idempotent"
and cell storage can be in RAM
There are two types of recovery:
Save() takes a long time for data to hit disk. As a solution, many systems optimize-->
One workaround: delay "external" actions/requests (such as graphics) until save() is actually done
another way: a later action might unexpectedly fail. Then we would have cascading
aborts and compensating actions to handle the disk writes.
Look forward to more on disk failures in the next lecture!