CS 111 Fall 2013
Lecture 13: File System Robustness
By Justin Bass and Synthia Ling
File System Goals
Robustness: System will still work even if unusual data patterns are input
Can decompose into subgoals:
Can decompose into subgoals:
- Durability- survives hardware failures e.g. lose power, but not arbitrary failures like a computer exploding. Limited failures that we expect to survive: disk space full, during power loss should not lose data
- Atomicity- changes are either done or not done, not partially done (could help during power loss)
- Performance- thoughtput and latency maintained under heavy loads
What can go wrong with hardware?
- Lose power
- lose contents of RAM, cache (CPU, controller), registers
- data on disk is retained
- Run out of disk space
- Heavily loaded system
- Sudden disconnection (of a hard drive)
- Bad sectors/tracks (a partial failure)
- Lose disk (disk crash)
- If there are 100 disks and 1 fails, we don't need to lose any data or even crash
Solutions:
- Backup power gives enough time to save RAM/cache- assume we're not doing this
- Suppose you're writing to disk - you still get a few microseconds after power is removed
- CPU executes a few instructions
- disk controller moves arms to a 'parking area' so data won't be incorrectly written somewhere else
- We'd like to assume sector writes are atomic, BUT they're not always
In Understanding the Robustness of SSDs under Power Fault by Zheng et al. at FAST '13
- Out of 15 SSDs, only 2 worked when they cut the power
- The 13 that failed to work suffered from
- bit corruption- bad bits in same or adjacent sectors
- flying writes- sectors got written perfectly but at wrong location, possibly due to wear-levelling that did not have time to store where the data was written
- shorn writes (partially written sectors)
- metadata corruption (where disk info is stored; this is serious because it affects the entire disk)
- dead device (no read/writes at all upon reboot, metadata too scrambled: all data is lost, disk requires reformatting)
- unserializability
- Sequentially requests A,B,C,D,E
- Writes A,C,E, but not B or D
Summary: Even though SSDs can do more operations after a power failure, they have more problems because their high performance involves parallel writes, wear-levelling, etc.
Loss of Power Model
Suppose we're trying to implement ^X^S in Emacs
write(fd, buf, sectorsize)
Assume no OS for now (no buffering, no dallying, etc.). If we lose power, we can lose data.
Assuming we have two devices, we must follow the...
Assuming we have two devices, we must follow the...
GOLDEN RULE OF ATOMICITY
Never overwrite your only copy
Never overwrite your only copy
Attempt 1: Have TWO copies of data
A old
B new
write(A)
write(B)
0: old version
1: new version
B new
write(A)
write(B)
0: old version
1: new version
1 | 2 | 3 | 4 | 5 | |
A | 0 | ? | 1 | 1 | 1 |
B | 0 | 0 | 0 | ? | 1 |
Method: if they disagree, choose A
Problems: bloated data by 2x, STILL does not work for column 2 as A is undetermined, so choosing A will give a garbage value
Attempt 2: Have THREE copies of data
1 | 2 | 3 | 4 | 5 | 6 | 7 | |
A | 0 | ? | 1 | 1 | 1 | 1 | 1 |
B | 0 | 0 | 0 | ? | 1 | 1 | 1 |
C | 0 | 0 | 0 | 0 | 0 | ? | 1 |
Method: choose the majority of the three, and for a disagreement (column 4) choose A, which tends to be written sooner
Solves
- bit correction
- shorn writes
- flying writes
- dead device
- unserialization
- metadata corruption
Problem: This solution is terrible for real-life purposes, and is rarely implemented.
This problem is only useful for helping us to learn how to model/formulate problems
Idealized model: writes are not atomic and they eventually finish
Idealized model: writes are not atomic and they eventually finish
Lampson Sturgis Assumptions
- Storage writes may fail, or corrupt another piece of storage
- A later read can detect the bad sector
- Storage may spontaneously decay (e.g. bits may flip across bus)
- Errors are rare
- Repairs can be done in time (before another failure occurs)
Eggert's Assumptions (overly simple and optimistic)
- block writes are atomic
assert(lseek(fd, 0, SEEKSET) % 8192 == 0)
write(fd, buf, 8192) //assume block is allocated
write(fd, buf, 8192) //assume block is allocated
Assume a BSD-style file system: superblock->bitmap->inodes->data
Thus the data is stored in the data blocks, and the timestamp is stored in the inodes
This could cause problems for time-sensitive programs (e.g. source control) if a power failure occurs when only one has written
now we do $ls -l file
Option 1: If we write a timestamp before we write data, the data could be wrong (a null change)
Option 2: If we write data before we write a timestamp, the timestamp could be wrong (a hidden change)
Option 1 is preferable, because we should err on the side of knowing our past actions rather than having actions that are not logged anywhere, but moving arm between timestamp (in inode) & data blocks requires a lot of time
Common optimization: do writes, daily timestamp
Problem: Breaks 'make' or other programs that depend on timestamp accuracy
Solution: Cleanup procedure after reboot
rename ("d/a", "d/b")
find d's inode
find d's data
find d's data
rename ("d/a", "d/bb")
can't set hard block allocation
rename ("d/a", "e/b")
If we write before handling link count and power is lost, we will have the correct data but linkcount will still be 1
Then if linkcount--, the inode will be freed even though data still exists.
Thus write/handle linkcount in this order:
increment link count
write e's data
write d's data
decrement link count
write e's data
write d's data
decrement link count
Link count never underestimates the value, but can overestimate.
Models of Underlying Systems
Invariants for our system
- Every block of the file system is used for exactly one purpose
- All referenced blocks are initialized to data appropriate for their type
- All referenced blocks are marked used in bitmap
- All unreferenced blocks are marked free in bitmap
Consequences of violating rules
- Disaster: If blocks have 2 purposes, you could think a block is free when it is not -> data loss
- Disaster
- Disaster
- Storage leak
Number 4 has the least consequence. It is not critical during certain system calls, so it is not really an invariant
Performance Improvement Ideas
- Commit Record
- Write "data" outputs into a preliminary area, e.g. a temp file
- Wait for all blocks to hit the disk (reordering is OK)
- Write a commit record saying done
- Copy temporary data back to original
- Write "done" to a commit record
pattern (appears to be atomic at higher level)- BEGIN
precommit phase- work is being done but you can't back out
- COMMIT (atomic at lower level)
postcommit phase- work, can't back out usually for performance reasons
- END
- Divide disk into 2 parts:
Cell Storage
Journal
Sequence of instruction listed in temporal order
Grab unused space, description of change, do change
Circularly linked: when journal reaches end, wrap to beginning (beginning entries should be obsolete by this point)
Optimized for writes
Reads are slow, but a commonly used subset could be cached in RAM
Logging Protocols:- Write ahead log:
- log planned writes
- implement them
-
Write behind logs:
- log old data about to be rewritten
- write new data into cell
- mark as done
- When replaying logs after crash
- write ahead log
replayed L to R
- write behind log
replayed R to L
- write ahead log
- Write ahead log: