By: Victor Wong, Nhi Nguyen

11/18 CS 111 Lecture
File System Robustness

Subgoals:
Durability
survives limited hardware failures
Ex) full disc, lose power, etc.
Atomicity
ability to either make a change or not
Performance
perform well with unusual requests

What can go wrong with hardware?
  • lose power
  • run out of disc space
  • heavily loaded
  • sudden disconnection
  • bad sectors/tracks
  • lose disc (disc crash)
  • Consider Simplest Case: Lose Power

    Lose contents of:
  • RAM
  • cache(CPU, controller)
  • registers
  • Keep contents of:
  • Disc
  • Flash
  • One common approach: (assume backup power: battery)
  • suppose you write to disc for a few us after shut off
  • disc controller lifts dis head to area w/ nothing there
  • CPU executes a few instructions
  • Want:
  • if process in the middle of writing to a section, then the write will finish
  • either power turns off before & sector write doesn't get executed or power turns off during and write finishes.
  • Assume:
  • sector writes are atomic
  • Example
    Zheng et al. FAST'13 "Understanding the Robustness of SSD under Power Fault"
    tested 15 SSD under power fault: 2 succeeded.
    Problems:
  • bit corruption: corrupted sectors
  • flying writes: sectors written to wrong location (due to load balancing)
  • shorn writes: writes parts of sectors but not the whole thing (due to wear leveling)
  • metadata corruption: invalid memory addresses
  • dead device:
  • unserializability: the writes were not done in order
  • *Golden Rule of Atomicity* - never overwrite your only copy
    One solution: keep two copies in two devices (A,B)

    0 - old version
    1- new version
    A	0	?	1	1 	1
    B 	0 	0 	0	?	1
    t ---->
    

    Example: both versions A and B don't agree - which one is correct?
    if both disagree choose A
    Problem:
    bloated data by twice.
    Solution:
    add a byte that checks if write is success
    -same problem of read and write though.
    Another solution:
    keep three copies of data
    Solution:
    three copies of data

    A	0	?	1	1	1	1	1
    B	0	0	0	?	1	1	1
    C	0	0	0	0	0	?	1
    
    *take the majority, if all three disagree then chose A solves bit corruption and shorn writes
    idealized model = writes are not atomic but eventually finish and return success
    Lampson-Sturgis model
    storage writes may fail (if they fail they might corrupt another piece of storage
    a late read than detect the bad sector
    storage may spontaneously decay. (electrical interference causes bit flips across BUS)
    Errors are rare
    Repairs can be done in time before another failure occurs (if you have disc failure someone can change/fix the disc before something else fails.)


    Eggert assumptions
    block writes are atomic


    lseek(fd,0,SEEK_SET)%8192==0) write(fd,buf,8192) <- assume already allocated assume BSD file systems

    $ls -l file (time updated)
    Nov 18 12:53 <- stored in inode
    if the system crashes the timestamp is wrong. (version control)
    update time stamp to notify user that something could change for 'make'

    common optimization:
    do writes, dally time stamps
    problem: breaks 'make' or operations that rely on timestamps
    solution: clean up procedure after reboot


    rename("d/a", "d/b")
    find "d" -> find d's inode
    find d's data
    rename "a" to "b"
    write to disc and update
    rename("d/a", "e/b")
    d:	|	|"a"	|	|	|	
    e:	|	|	|	|	|
    
    "a" has inode #573
    a) erase "a" first
    >if crash then data is lost
    b) write to e
    if crash then the link count is wrong b/c there are two files that have inode number 573 but the inode link count is only 1:
    problems when deleting files.
    there will be a directory entry that has invalid inode #.
    solution:
    increment "a"'s link count (i.e. the link count of inode# 573)
    write data to "b" of "e"
    erase "a" from "d"
    decrement "a"'s link count (i.e. the link count of inode# 573)
    **this approach may have storage whose link count may not drop to 0 and never gets reclaimed
    the link count can be an overestimate of the true value, but never an underestimate
    need to have model for invariantsof file system:
    1. every block of the file system is used for exactly one purpose
    boot, superblock, bitmap, inode, data
    2. all referenced blocks are initialized to data appropriate for their type
    only blocks that contains data being pointed to are in use
    3. all referenced blocks are marked as used in bit map
    4. all unreferenced blocks are marked free in bitmap

    consequences of violating invariant file system:

    1. data loss; disaster(writing data to bitmap block will overwrite bitmap)
    2. inode entries that look like they're in use; access invalid data -> disaster
    3. someone else will allocate block; overwrite files; data loss
    4. storage leak

    rename: 4 I/O operations,

    4 seeks 40 ms
    Performance improvement ideas
    1. commit record:
    a) write 'data' outputs into a preliminary area (write blocks of file into temp file) 100 blocks
    b) wait for all blocks to write into disc (reordering is okay)
    c) write to commit record (write "commit")
    d) copy temp data back to original, wait to hit disc (for the disc head to be ready)
    e) write "done" to commit area.
    *use the commit record to control atomicity
    General Pattern:
    BEGIN(marker)
    pre-commit phase (write to temp file)
    work is being done but can still back out (if system crashes now, nothing happens)
    COMMIT or ABORT
    post-commit phase
    work is being done, can't back out.
    END
    ** user can decide up until COMMIT marker to either keep going or backout.
    2. Journal
    sequence of instruction listed in temporal order
    divides the FileSystem into cellblock and entries in journal refer to the cell index.
    	|	|	|	|	|	|	|	|	|	|	|	|	|	|	|unused	area|
    	t --->
    
    each block describes a change to the file system Ex) entry: cellblock 2793 newversion: 1 <- 'data' to be written to cellblock.
    1 + 2.
    combination of commit records and Journal
    implement commit record in a Journal:
    	|	|	|	|	|	|BEGIN	|WRITE	|COMMIT|COPY	|DONE|	|unused	area|
    						^write "begin"	^write "commit" ^write "done"	
    	t --->
    
    *gives good write performance but reads are slow

    Solution: cache the cell blocks into RAM(commonly used blocks)

    or 4.
    keep cell blocks and Journal in memory and cache some in RAM

    Login protocol:
    write head logs
    plan writes
    later on implement
    write behind logs
    log old data about to be overwritten
    write new data into cell
    mark as done
    replaying logs
    Left to right