CS 111 Spring 2012

Lecture 13: File System Robustness

by Junda Zhu for a lecture by Professor Paul Eggert on May 17, 2012 at UCLA

What can go wrong with file systems?

In this lecture, we will handle the topic of file system robustness. To understand the robustness for file systems, we will first look at some common problems we will encounter.

Design Goal: to build reliable system, response well to bad things and make sure files not getting lost
Obstacles:
- External Events
  - Things that can go wrong from outside the file system
  - Ex: "Bad", or "tricky" user requests
- Internal Events
  - Things that can go wrong inside the file system
  - Ex: Disk failure, Program bug (not user's fault)
- We want hard modularity so that our FS is not affected when either or both these events happen

Back to Contents

Approaches to Building Reliability

[Approach 1] Build reliable components
- Assume each component works the way it should
- Delegated reliability problem to the individual components
  - We don't need to worry about reliability at the highest level
[Approach 2] Assume the components are unreliable and make everything reliable at the current level
Examples:
- Spontaneous errors in disk controller cache
  - Error checking within cache (Approach 1)
    // Using internal checksums to validate data (hardware)
  - Error checking code within the FS (Approach 2)
    // File system can on its own, maintain a checksum for all the disk blocks (software)
- Google's choice: purchased cheap and unreliable servers for low cost, organize thorough and reliable code at high level (Approach 2)

Back to Contents

What are the Bad Events?

External Events
- Problems that occur outside of the PC system
- Examples:
  - Power loss (Focus of this lecture!)
  - Many requests that overwhelm the system (DOS Attack)
  - Remove/updates that are unintended
    // For SEASNET machines, an easy fix for unintended "rm *" is "cd .snapshot"
Internal Events
- Problems that occur inside the PC system
- Examples:
  - Disk error rate overwhelms its internal checksums
  - Disk arm/head crash
  - Bits flip in the controller cache
  - Bits flip in the RAM
    // Bits flip often caused by cosmic radiations; it can be fixed by error-correcting code (ECC).

Back to Contents

Lecture Focus - Loss of Power

(A) Approach 1 & 2

Using Approach 1: Build reliable components
- Example: UPS: "Uninterruptible" Power Supply (backup battery)
  - This is more of a hardware approach
  - UPS Diagram
    
    Image source: http://en.wikipedia.org/wiki/Uninterruptible_power_supply
Using Approach 2: Assume the components are unreliable and make everything reliable at the current level
- Property we want: not lose any data after a power loss
  - "After-Reboot" data is the same as "Just-Before-Crash" data
- "grade.txt" example:
  - Simple file of size 20,000 bytes - "grade.txt"
    // Program always read/write the entire file
```
	read(fd, buf, 20000);       // LOAD operation
	write(fd, buf, 20000);      // SAVE operation
```
  - Load Operation
    // Easy to implement since no data is changing
  - Save Operation
    * Harder to implement
    * inode
    - Contains meta information
    - Last modified date
    - File length
    - Data pages (assume 3 blocks)
    * Could have a power interrupt between writing data blocks
    * Could have a power interrupt before last modified time was modified
    - Need to modify the inode (contains the metadata)

Back to Contents

(B) Consistency & Some Impractical Approaches

Consistency
- We want data and metadata to be consistent
- We want all data blocks to be consistent

Modified Version 1 of "grade.txt" example:



	open("grade.txt", O_NOMETADATA);
          // Fake flag (O_NOMETADATA)
          // Don't open the corresponding metadata
        write(fd, buf, 8192);
          // May not work b/c the power may be interrupted as the disk controller is 
	  // transferring data from the buffer to the disk.
          // Cannot assume writes to FS are atomic!

Modified Version 2 of "grade.txt" example:
- Change application so it has two copies of the file
  - Read from the main copy and write to the tmp copy
  - After the tmp copy is done, modify the tmp copy as necessary
  - Copy the tmp copy over the main copy
  - Problem:
```
	MAIN    A  A  A  ?  B
        TMP     A  ?  B  B  B
        ----------------------
       	Time    1  2  3  4  5

	// The boot procedure doesn't know which copy contains the correct copy (doesn't work)
```

Modified Version 3 of "grade.txt" example:

Change application so it has three copies of the file
* Read from the main copy and write to both tmp copies
* Scenario:



	MAIN    A  A  A  A  A  ?  B
        TMP1    A  ?  B  B  B  B  B
        TMP2    A  A  A  ?  B  B  B
        ----------------------------
        Time    1  2  3  4  5  6  7

	  // This works.
          // If the copies disagree, take the copy with majority agreement
          // If all 3 disagree, pick TMP1 (see time = 4)

Back to Contents

(C) Commit Record

We want something faster than the previous method, commit record does the job.

Commit Record

An extra block that keeps track of which copy of the file is the real copy
Commit record is kept small so that writes are atomic (assumption)
* If commit record bit is 0, look at main
* If commit record bit is 1, look at tmp

Scenario:



	20000  MAIN     A  A  A  A  ?  B  B
        20000  TMP      A  ?  B  B  B  B  B
        512    COMMIT   0  0  0  1  1  1  0
        ------------------------------------
        time            1  2  3  4  5  6  7

          // The commit size is 512 bytes, small enough to be atomic 
	  // The operation at time = 7 can be omitted for efficiency

Pattern for implementing an "All-or-Nothing" operation atop a system in which some operations are atomic
- Scenario:
```
	20000  MAIN     A  A  A  A  ?  B  B
        20000  TMP      A  ?  B  B  B  B  B
        512    COMMIT   0  0  0  1  1  1  0
        ------------------------------------
        time            1  2  3  4  5  6  7
				
	  // The operation at time = 7 can be omitted for efficiency
```
- Pre-Commit Actions:
  * Can back-out of these operations and leave no trace of the high level operations
  * time: 1 ~ 3
- Atomic Action:
  * Atomic at lower level
  * time: 4
- Post-Commit Phase:
  * We cannot change our minds
  * We can improve performance
  * time: 5 ~ 7

Back to Contents

(D) Journaling

Another method that works well.

Taken from double-entry bookkeeping
- Never erase anything from your ledger
Taken from double-entry bookkeeping
- Never erase anything from your ledger
Journal
- Append-only array indicies
- Trying to simulate atomic writes on the cell storage
- Contains our commit records
  - Also contains pre-commit and post-commit data
Cell Stroage
- "Ordinary" part of the FS
Write-Ahead Log
- Log the update in the journal
  - Ex: Intent to change block #27 to contain B
- Write a commit record
- Write the changed blocks to the cell storage
- Clear the commit record by appending a clear record to the journal
Recovery Protocol (Write-Ahead)
- Scan the journal from left-to-right, looking for commits without associated clears
- Apply the commits until the file system is stable
Write-Behind Log
- Log the old value into the journal
  - Store the old value (A) of block #27 into the journal
- Write the new value directly into the block (B)
- Perform a commit onto the journal
Recovery Protocol (Write-Behind)
- Scan backwards (right-to-left) through log looking for uncommitted operations
Advantages (Write-Ahead)
- The journal and cell storage does not need to be on the same disk
  - If one disk contains the journal and the other contains the cell storage, the journaling disk has relatively no seeks
- File system can be cached in RAM if the FS fits into RAM
  - Doesn't require writing to disk
Advantages (Write-Behind)
- Immediately undo the operations
  - But need to copy data and sometimes access from cell storage

Journaling Example:



	rename("d/f", "e/g");
        + get f inode into RAM
        + get d inode into RAM
        + get d data
        + get e inode
        + get e data
        + update d data in RAM
        + update e data in RAM
        + write f inode with link count = 2
        + write e data
        + write d data
        + write f inode with link count = 1

Back to Contents

Golden Rule of Atomicity

Never overwrite your only copy unless your underlying write operation is known to be atomic * If not, either not perform an operation or do an operation in its entirety

Lampson-Sturgis Assumptions

Storage writes may fail and when they fail they may corrupt adjacent blocks
- Incorrect data might be written
Later reads can later detect a badly written sector
- underlying disk hardware has error correction that can detect bad reads
A storage can spontaneously decay causing it to go bad
- Later reads can detect the error
Errors are rare occurrences
Individual sector writes are atomic

Back to Contents

This document is fully HTML 4.01 compliant:

CS 111 Spring 2012

Lecture 13: File System Robustness

by Junda Zhu for a lecture by Professor Paul Eggert on May 17, 2012 at UCLA

Table of Contents

What can go wrong with file systems?

Approaches to Building Reliability

What are the Bad Events?

Lecture Focus - Loss of Power

(A) Approach 1 & 2

(B) Consistency & Some Impractical Approaches

(C) Commit Record

(D) Journaling

Journaling Example:

Golden Rule of Atomicity

Lampson-Sturgis Assumptions