File System Goals


Robustness: System will still work even if unusual data patterns are input
Can decompose into subgoals:

What can go wrong with hardware?


Solutions:

In Understanding the Robustness of SSDs under Power Fault by Zheng et al. at FAST '13

Summary: Even though SSDs can do more operations after a power failure, they have more problems because their high performance involves parallel writes, wear-levelling, etc.

Loss of Power Model


Suppose we're trying to implement ^X^S in Emacs
write(fd, buf, sectorsize)
Assume no OS for now (no buffering, no dallying, etc.). If we lose power, we can lose data.

Assuming we have two devices, we must follow the...
GOLDEN RULE OF ATOMICITY
Never overwrite your only copy

Attempt 1: Have TWO copies of data
A old
B new
write(A)
write(B)
0: old version
1: new version
 12345
A0?111
B000?1
Method: if they disagree, choose A
Problems: bloated data by 2x, STILL does not work for column 2 as A is undetermined, so choosing A will give a garbage value

Attempt 2: Have THREE copies of data
 1234567
A0?11111
B000?111
C00000?1
Method: choose the majority of the three, and for a disagreement (column 4) choose A, which tends to be written sooner
Solves But not
Problem: This solution is terrible for real-life purposes, and is rarely implemented.

This problem is only useful for helping us to learn how to model/formulate problems
Idealized model: writes are not atomic and they eventually finish

Lampson Sturgis Assumptions


Eggert's Assumptions (overly simple and optimistic)

assert(lseek(fd, 0, SEEKSET) % 8192 == 0)
write(fd, buf, 8192) //assume block is allocated
Assume a BSD-style file system: superblock->bitmap->inodes->data
Thus the data is stored in the data blocks, and the timestamp is stored in the inodes
This could cause problems for time-sensitive programs (e.g. source control) if a power failure occurs when only one has written
now we do $ls -l file

Option 1: If we write a timestamp before we write data, the data could be wrong (a null change)
Option 2: If we write data before we write a timestamp, the timestamp could be wrong (a hidden change)

Option 1 is preferable, because we should err on the side of knowing our past actions rather than having actions that are not logged anywhere, but moving arm between timestamp (in inode) & data blocks requires a lot of time

Common optimization: do writes, daily timestamp
Problem: Breaks 'make' or other programs that depend on timestamp accuracy
Solution: Cleanup procedure after reboot

rename ("d/a", "d/b")
find d's inode
find d's data
rename ("d/a", "d/bb")
can't set hard block allocation
rename ("d/a", "e/b")
E and D's data
If we write before handling link count and power is lost, we will have the correct data but linkcount will still be 1
Then if linkcount--, the inode will be freed even though data still exists.

Thus write/handle linkcount in this order:
increment link count
write e's data
write d's data
decrement link count
Link count never underestimates the value, but can overestimate.

Models of Underlying Systems


Invariants for our system

  1. Every block of the file system is used for exactly one purpose
  2. All referenced blocks are initialized to data appropriate for their type
  3. All referenced blocks are marked used in bitmap
  4. All unreferenced blocks are marked free in bitmap

Consequences of violating rules

  1. Disaster: If blocks have 2 purposes, you could think a block is free when it is not -> data loss
  2. Disaster
  3. Disaster
  4. Storage leak

Number 4 has the least consequence. It is not critical during certain system calls, so it is not really an invariant

Performance Improvement Ideas


  1. Commit Record
    • Write "data" outputs into a preliminary area, e.g. a temp file
    • Wait for all blocks to hit the disk (reordering is OK)
    • Write a commit record saying done
    • Copy temporary data back to original
    • Write "done" to a commit record
    pattern (appears to be atomic at higher level)
    • BEGIN
      precommit phase- work is being done but you can't back out
    • COMMIT (atomic at lower level)
      postcommit phase- work, can't back out usually for performance reasons
    • END
  2. Divide disk into 2 parts:
    Cell Storage
    Cell Storage
    Journal
    Journal
    Sequence of instruction listed in temporal order
    Grab unused space, description of change, do change
    Circularly linked: when journal reaches end, wrap to beginning (beginning entries should be obsolete by this point)
    Optimized for writes
    Reads are slow, but a commonly used subset could be cached in RAM
    Logging Protocols:
    • Write ahead log:
      • log planned writes
      • implement them
    • Write behind logs:
      • log old data about to be rewritten
      • write new data into cell
      • mark as done
    • When replaying logs after crash
      • write ahead log
        replayed L to R
      • write behind log
        replayed R to L

Valid HTML 4.01 Transitional