Lecture 13: File System Robustness
Notes prepared by: Bryan Chaidez and Magdaleno Nunez
File System Goals
Performance - High throughput (ability to handle lots of requests per second) and low latency
Atomicity - When you make a change to the file system, changes are either done or not done, never partially done.
Durability - Data survives failure of the underlying system (only limited failure)
What could go wrong with durability?
- Keyboard could die
- CPU failure, disk could die
- Battery dies
- RAM could unexpectedly flip a random bit
- //insert windows joke here
What failures are we worried about?
We want limited failure, within reason
Errors that are most likely and/or most damaging
Models of Underlying Systems
We need a model for failures
- Affects our design
- Should correspond to the real world
- Want simplicity
An example with a simple model
- No battery
- Can only lose power
Suppose during write_sector, we lose power
- If you're lucky, the power supply might have enough power to complete the command
- There will not be power to complete the command, resulting in a bad sector
To model this, we can assume this is how write works:
Golden Rule: Never overwrite your only copy
Using 4 writes:
- Write F2 first (F2 = Y), set write marker equal to 2
- Write F1 next (F1 = Y), set write marker equal to 1
Using 3 writes:
- We didn't need 4 writes because marker now points to 2
- However, we still need 3 writes to do 1 write
- Is there a more efficient way?
Using timestamps
- Write F2 (F2 = Y), then concatenate the timestamp
- Now we only need 1 write, but timestamp could be for garbage
Classic Solution
- Strive to have all 3 copies hold the same value
- To read after crash:
if(F1 == F2)
return F1 //or F2
else
return F3
- This still needs 3 copies; we want less
Lampson-Sturgis Assumptions
- Storage writes may fail and can corrupt written sector or a sector nearby
- Storage can decay spontaneously
- A later read can detect the bad sectors
- Errors are relateively rare
- Repairs can be done in time
e.g. rename("a","b")
- read inode 357 into memory
- read data block 952736 into RAM
- change "A" to "B" in RAM
- write data block out to disk
- update inode timestamp
- write inode to disk
What can go wrong?
Crashes after steps 4 and 5 may cause problems
CS111 Assumptions
Single data block writes are atomic (all or nothing)
e.g. rename("a","bb")
This can copy an entry and move to a bigger entry in data blocks (only works if block is not full)
e.g. rename("d/a","e/a")
read 357, 362, 952736, 1012 into RAM, and assume there is enough room in 1012 for a new entry
- write 1012
- write 952736
- write 362
- write 357
We do 1 first so we don't lose track of the file. If we crash between 1 and 2, then remove 362 or 357,
we lose 921, which results in a dangling pointer.
To fix this, after step 1, write "921 link count to 2".
After step 2, change the count back to 1
The problem here is, if we crash between 3 and 4, we will have memory leaks because the link count will be too high
If we crash between 1 and 2, we'll have this problem as well
To fix this, change order to:
- change 921 link count to 2
- write 1012
- write 952736
- change 921 link count to 1
- write 362
- write 357
Invariants
These are conditions that should always be true for your system
e.g. for BSD-style file system
- Every block in the file system is used for exactly one purpose
- All reference blocks are properly initialized wiht data appropriate for their type
- All reference data blocks are marked as used in bitmap
- All unreferenced data blocks are marked as free in bitmap
Consequences for violating the corresponding invariant:
- Disaster
- Disaster
- Disaster
- Memory Leak
If we needed to modify an invariant, we should choose 4, since it has the least serious consequence.
Building Robust File Systems
Commit Record
Keep a separate record used to mark whether an atomic action is done or not done
Code for doing commits:
BEGIN
pre-commit phase (invisible to upper level)
COMMIT or ABORT
post-commit phase (invisible to upper level)
END
From the user's point of view, the whole commit process is atomic
Journaling
- - Wastes storage and will eventually run off disks
- + Solves many inconsistency problems
- + If mostly writing, avoids seeks
- log the change into the journal first
- put a commit record at the end of file system
- later, we can copy it into cell storage and mark it as DONE
Recovery must be idempotent (because recovery can crash too!)
Recover must handle aborts as well
Write-ahead log:
- log what you do
- commit
- copy in cell data
Write-behind log:
- log old cell contents
- write new cell contents
- commit
Both methods work, but with different performance