By: Victor Wong, Nhi Nguyen

11/18 CS 111 Lecture
File System Robustness

Subgoals:

Durability: survives limited hardware failures
Ex) full disc, lose power, etc.
Atomicity: ability to either make a change or not
Performance: perform well with unusual requests

What can go wrong with hardware?: lose power

run out of disc space

heavily loaded

sudden disconnection

bad sectors/tracks

lose disc (disc crash)

Consider Simplest Case: Lose Power

Lose contents of:

RAM

cache(CPU, controller)

registers

Keep contents of:

Disc

Flash

One common approach: (assume backup power: battery)

suppose you write to disc for a few us after shut off

disc controller lifts dis head to area w/ nothing there

CPU executes a few instructions

Want:

if process in the middle of writing to a section, then the write will finish

either power turns off before & sector write doesn't get executed or power turns off during and write finishes.

Assume:

sector writes are atomic

Example

Zheng et al. FAST'13 "Understanding the Robustness of SSD under Power Fault"
tested 15 SSD under power fault: 2 succeeded.

Problems:

bit corruption: corrupted sectors

flying writes: sectors written to wrong location (due to load balancing)

shorn writes: writes parts of sectors but not the whole thing (due to wear leveling)

metadata corruption: invalid memory addresses

dead device:

unserializability: the writes were not done in order

*Golden Rule of Atomicity* - never overwrite your only copy
One solution: keep two copies in two devices (A,B)

0 - old version
1- new version
A	0	?	1	1 	1
B 	0 	0 	0	?	1
t ---->

Example: both versions A and B don't agree - which one is correct?

if both disagree choose A

Problem:

bloated data by twice.

Solution:

add a byte that checks if write is success
-same problem of read and write though.

Another solution:

keep three copies of data

Solution:

three copies of data

A	0	?	1	1	1	1	1
B	0	0	0	?	1	1	1
C	0	0	0	0	0	?	1

*take the majority, if all three disagree then chose A solves bit corruption and shorn writes

idealized modelLampson-Sturgis model
storage writes may fail (if they fail they might corrupt another piece of storage
a late read than detect the bad sector
storage may spontaneously decay. (electrical interference causes bit flips across BUS)
Errors are rare
Repairs can be done in time before another failure occurs (if you have disc failure someone can change/fix the disc before something else fails.)

Eggert assumptions

block writes are atomic

$ls -l file (time updated)

Nov 18 12:53 <- stored in inode

if the system crashes the timestamp is wrong. (version control)

update time stamp to notify user that something could change for 'make'

common optimization:

do writes, dally time stamps: problem: breaks 'make' or operations that rely on timestamps
solution: clean up procedure after reboot

rename("d/a", "d/b")

find "d" -> find d's inode: find d's data
rename "a" to "b"
write to disc and update

rename("d/a", "e/b")

d:	|	|"a"	|	|	|	
e:	|	|	|	|	|

"a" has inode #573

a) erase "a" first

>if crash then data is lost

b) write to e

if crash then the link count is wrong b/c there are two files that have inode number 573 but the inode link count is only 1:: problems when deleting files.
there will be a directory entry that has invalid inode #.

solution:

increment "a"'s link count (i.e. the link count of inode# 573)
write data to "b" of "e"
erase "a" from "d"
decrement "a"'s link count (i.e. the link count of inode# 573)

**this approach may have storage whose link count may not drop to 0 and never gets reclaimed
the link count can be an overestimate of the true value, but never an underestimate

need to have model for invariantsof file system:

1. every block of the file system is used for exactly one purpose: boot, superblock, bitmap, inode, data
2. all referenced blocks are initialized to data appropriate for their type: only blocks that contains data being pointed to are in use

3. all referenced blocks are marked as used in bit map
4. all unreferenced blocks are marked free in bitmap

consequences of violating invariant file system:

1. data loss; disaster(writing data to bitmap block will overwrite bitmap)
2. inode entries that look like they're in use; access invalid data -> disaster
3. someone else will allocate block; overwrite files; data loss
4. storage leak

rename: 4 I/O operations,

4 seeks 40 ms

Performance improvement ideas

1. commit record:: a) write 'data' outputs into a preliminary area (write blocks of file into temp file) 100 blocks
b) wait for all blocks to write into disc (reordering is okay)
c) write to commit record (write "commit")
d) copy temp data back to original, wait to hit disc (for the disc head to be ready)
e) write "done" to commit area.

*use the commit record to control atomicity

General Pattern:

BEGIN(marker): pre-commit phase (write to temp file)
work is being done but can still back out (if system crashes now, nothing happens)
COMMIT or ABORT: post-commit phase
work is being done, can't back out.
END

** user can decide up until COMMIT marker to either keep going or backout.

2. Journal: sequence of instruction listed in temporal order
divides the FileSystem into cellblock and entries in journal refer to the cell index.

	|	|	|	|	|	|	|	|	|	|	|	|	|	|	|unused	area|
	t --->

each block describes a change to the file system Ex) entry: cellblock 2793 newversion: 1 <- 'data' to be written to cellblock.

1 + 2.: combination of commit records and Journal
implement commit record in a Journal:

	|	|	|	|	|	|BEGIN	|WRITE	|COMMIT|COPY	|DONE|	|unused	area|
						^write "begin"	^write "commit" ^write "done"	
	t --->

*gives good write performance but reads are slow

Solution: cache the cell blocks into RAM(commonly used blocks)

or 4.: keep cell blocks and Journal in memory and cache some in RAM

Login protocol:

write head logs
plan writes
later on implement

write behind logs: log old data about to be overwritten
write new data into cell
mark as done

replaying logs: Left to right