By: Victor Wong, Nhi Nguyen
11/18 CS 111 Lecture
File System Robustness
- Subgoals:
-
- Durability
- survives limited hardware failures
Ex) full disc, lose power, etc.
- Atomicity
- ability to either make a change or not
- Performance
- perform well with unusual requests
- What can go wrong with hardware?
-
- lose power
- run out of disc space
- heavily loaded
- sudden disconnection
- bad sectors/tracks
- lose disc (disc crash)
Consider Simplest Case: Lose Power
Lose contents of:
RAM
cache(CPU, controller)
registers
Keep contents of:
Disc
Flash
One common approach: (assume backup power: battery)
suppose you write to disc for a few us after shut off
disc controller lifts dis head to area w/ nothing there
CPU executes a few instructions
Want:
if process in the middle of writing to a section, then the write will finish
either power turns off before & sector write doesn't get executed or power turns off during and write finishes.
Assume:
sector writes are atomic
Example
Zheng et al. FAST'13 "Understanding the Robustness of SSD under Power Fault"
tested 15 SSD under power fault: 2 succeeded.
Problems:
bit corruption: corrupted sectors
flying writes: sectors written to wrong location (due to load balancing)
shorn writes: writes parts of sectors but not the whole thing (due to wear leveling)
metadata corruption: invalid memory addresses
dead device:
unserializability: the writes were not done in order
*Golden Rule of Atomicity* - never overwrite your only copy
One solution:
keep two copies in two devices (A,B)
0 - old version
1- new version
A 0 ? 1 1 1
B 0 0 0 ? 1
t ---->
Example: both versions A and B don't agree - which one is correct?
if both disagree choose A
Problem:
bloated data by twice.
Solution:
add a byte that checks if write is success
-same problem of read and write though.
Another solution:
keep three copies of data
Solution:
three copies of data
A 0 ? 1 1 1 1 1
B 0 0 0 ? 1 1 1
C 0 0 0 0 0 ? 1
*take the majority, if all three disagree then chose A
solves bit corruption and shorn writes
idealized model = writes are not atomic but eventually finish and return success
- Lampson-Sturgis model
-
storage writes may fail (if they fail they might corrupt another piece of storage
a late read than detect the bad sector
storage may spontaneously decay. (electrical interference causes bit flips across BUS)
Errors are rare
Repairs can be done in time before another failure occurs (if you have disc failure someone can change/fix the disc before something else fails.)
- Eggert assumptions
- block writes are atomic
lseek(fd,0,SEEK_SET)%8192==0)
write(fd,buf,8192) <- assume already allocated
assume BSD file systems
- $ls -l file (time updated)
- Nov 18 12:53 <- stored in inode
- if the system crashes the timestamp is wrong. (version control)
- update time stamp to notify user that something could change for 'make'
- common optimization:
-
- do writes, dally time stamps
-
problem: breaks 'make' or operations that rely on timestamps
solution: clean up procedure after reboot
- rename("d/a", "d/b")
-
- find "d" -> find d's inode
-
find d's data
rename "a" to "b"
write to disc and update
rename("d/a", "e/b")
d: | |"a" | | |
e: | | | | |
"a" has inode #573
- a) erase "a" first
-
>if crash then data is lost
- b) write to e
-
- if crash then the link count is wrong b/c there are two files that have inode number 573 but the inode link count is only 1:
-
problems when deleting files.
there will be a directory entry that has invalid inode #.
- solution:
-
increment "a"'s link count (i.e. the link count of inode# 573)
write data to "b" of "e"
erase "a" from "d"
decrement "a"'s link count (i.e. the link count of inode# 573)
**this approach may have storage whose link count may not drop to 0 and never gets reclaimed
the link count can be an overestimate of the true value, but never an underestimate
- need to have model for invariantsof file system:
-
- 1. every block of the file system is used for exactly one purpose
- boot, superblock, bitmap, inode, data
- 2. all referenced blocks are initialized to data appropriate for their type
- only blocks that contains data being pointed to are in use
3. all referenced blocks are marked as used in bit map
4. all unreferenced blocks are marked free in bitmap
- consequences of violating invariant file system:
-
1. data loss; disaster(writing data to bitmap block will overwrite bitmap)
2. inode entries that look like they're in use; access invalid data -> disaster
3. someone else will allocate block; overwrite files; data loss
4. storage leak
- rename: 4 I/O operations,
- 4 seeks 40 ms
Performance improvement ideas
- 1. commit record:
-
a) write 'data' outputs into a preliminary area (write blocks of file into temp file) 100 blocks
b) wait for all blocks to write into disc (reordering is okay)
c) write to commit record (write "commit")
d) copy temp data back to original, wait to hit disc (for the disc head to be ready)
e) write "done" to commit area.
*use the commit record to control atomicity
- General Pattern:
-
- BEGIN(marker)
- pre-commit phase (write to temp file)
work is being done but can still back out (if system crashes now, nothing happens)
- COMMIT or ABORT
- post-commit phase
work is being done, can't back out.
- END
-
** user can decide up until COMMIT marker to either keep going or backout.
- 2. Journal
-
sequence of instruction listed in temporal order
divides the FileSystem into cellblock and entries in journal refer to the cell index.
| | | | | | | | | | | | | | |unused area|
t --->
each block describes a change to the file system
Ex)
entry:
cellblock 2793
newversion: 1 <- 'data' to be written to cellblock.
- 1 + 2.
- combination of commit records and Journal
implement commit record in a Journal:
| | | | | |BEGIN |WRITE |COMMIT|COPY |DONE| |unused area|
^write "begin" ^write "commit" ^write "done"
t --->
*gives good write performance but reads are slow
Solution: cache the cell blocks into RAM(commonly used blocks)
-
or 4.
-
keep cell blocks and Journal in memory and cache some in RAM
- Login protocol:
-
write head logs
plan writes
later on implement
- write behind logs
-
log old data about to be overwritten
write new data into cell
mark as done
- replaying logs
-
Left to right