CS 111 Spring 2012
Lecture 13: File System Robustness
by Junda Zhu
for a lecture by Professor Paul Eggert on May 17, 2012 at UCLA
Table of Contents
- What can go wrong with file systems?
- Approaches to Building Reliability
- What are the Bad Events?
- Lecture Focus - Loss of Power
- Approach 1 & 2
- Consistency & Some Impractical Approaches
- Commit Record
- Journaling
- Golden Rule of Atomicity
- Lampson-Sturgis Assumptions
What can go wrong with file systems?
In this lecture, we will handle the topic of file system robustness. To understand the robustness for file systems, we will first look at some common problems we will encounter.
- Design Goal: to build reliable system, response well to bad things and make sure files not getting lost
- Obstacles:
- External Events
- Things that can go wrong from outside the file system
- Ex: "Bad", or "tricky" user requests
- Internal Events
- Things that can go wrong inside the file system
- Ex: Disk failure, Program bug (not user's fault)
- We want hard modularity so that our FS is not affected when either or both these events happen
Back to Contents
Approaches to Building Reliability
- [Approach 1] Build reliable components
- Assume each component works the way it should
- Delegated reliability problem to the individual components
- We don't need to worry about reliability at the highest level
- [Approach 2] Assume the components are unreliable and make everything reliable at the current level
- Examples:
- Spontaneous errors in disk controller cache
- Error checking within cache (Approach 1)
// Using internal checksums to validate data (hardware)
- Error checking code within the FS (Approach 2)
// File system can on its own, maintain a checksum for all the disk blocks (software)
- Google's choice: purchased cheap and unreliable servers for low cost, organize thorough and reliable code at high level (Approach 2)
Back to Contents
What are the Bad Events?
- External Events
- Problems that occur outside of the PC system
- Examples:
- Power loss (Focus of this lecture!)
- Many requests that overwhelm the system (DOS Attack)
- Remove/updates that are unintended
// For SEASNET machines, an easy fix for unintended "rm *" is "cd .snapshot"
- Internal Events
- Problems that occur inside the PC system
- Examples:
- Disk error rate overwhelms its internal checksums
- Disk arm/head crash
- Bits flip in the controller cache
- Bits flip in the RAM
// Bits flip often caused by cosmic radiations; it can be fixed by error-correcting code (ECC).
Back to Contents
Lecture Focus - Loss of Power
(A) Approach 1 & 2
- Using Approach 1: Build reliable components
- Example: UPS: "Uninterruptible" Power Supply (backup battery)
- Using Approach 2: Assume the components are unreliable and make everything reliable at the current level
- Property we want: not lose any data after a power loss
- "After-Reboot" data is the same as "Just-Before-Crash" data
- "grade.txt" example:
- Simple file of size 20,000 bytes - "grade.txt"
// Program always read/write the entire file
read(fd, buf, 20000); // LOAD operation
write(fd, buf, 20000); // SAVE operation
- Load Operation
// Easy to implement since no data is changing
- Save Operation
* Harder to implement
* inode
- Contains meta information
- Last modified date
- File length
- Data pages (assume 3 blocks)
* Could have a power interrupt between writing data blocks
* Could have a power interrupt before last modified time was modified
- Need to modify the inode (contains the metadata)
Back to Contents
(B) Consistency & Some Impractical Approaches
- Consistency
- We want data and metadata to be consistent
- We want all data blocks to be consistent
- Modified Version 1 of "grade.txt" example:
open("grade.txt", O_NOMETADATA);
// Fake flag (O_NOMETADATA)
// Don't open the corresponding metadata
write(fd, buf, 8192);
// May not work b/c the power may be interrupted as the disk controller is
// transferring data from the buffer to the disk.
// Cannot assume writes to FS are atomic!
- Modified Version 2 of "grade.txt" example:
- Change application so it has two copies of the file
- Modified Version 3 of "grade.txt" example:
- Change application so it has three copies of the file
* Read from the main copy and write to both tmp copies
* Scenario:
MAIN A A A A A ? B
TMP1 A ? B B B B B
TMP2 A A A ? B B B
----------------------------
Time 1 2 3 4 5 6 7
// This works.
// If the copies disagree, take the copy with majority agreement
// If all 3 disagree, pick TMP1 (see time = 4)
Back to Contents
(C) Commit Record
We want something faster than the previous method, commit record does the job.
Back to Contents
(D) Journaling
Another method that works well.
- Taken from double-entry bookkeeping
- Never erase anything from your ledger
- Taken from double-entry bookkeeping
- Never erase anything from your ledger
- Journal
- Append-only array indicies
- Trying to simulate atomic writes on the cell storage
- Contains our commit records
- Also contains pre-commit and post-commit data
- Cell Stroage
- "Ordinary" part of the FS
- Write-Ahead Log
- Log the update in the journal
- Ex: Intent to change block #27 to contain B
- Write a commit record
- Write the changed blocks to the cell storage
- Clear the commit record by appending a clear record to the journal
- Recovery Protocol (Write-Ahead)
- Scan the journal from left-to-right, looking for commits without associated clears
- Apply the commits until the file system is stable
- Write-Behind Log
- Log the old value into the journal
- Store the old value (A) of block #27 into the journal
- Write the new value directly into the block (B)
- Perform a commit onto the journal
- Recovery Protocol (Write-Behind)
- Scan backwards (right-to-left) through log looking for uncommitted operations
- Advantages (Write-Ahead)
- The journal and cell storage does not need to be on the same disk
- If one disk contains the journal and the other contains the cell storage, the journaling disk has relatively no seeks
- File system can be cached in RAM if the FS fits into RAM
- Doesn't require writing to disk
- Advantages (Write-Behind)
- Immediately undo the operations
- But need to copy data and sometimes access from cell storage
Journaling Example:
rename("d/f", "e/g");
+ get f inode into RAM
+ get d inode into RAM
+ get d data
+ get e inode
+ get e data
+ update d data in RAM
+ update e data in RAM
+ write f inode with link count = 2
+ write e data
+ write d data
+ write f inode with link count = 1
Back to Contents
Golden Rule of Atomicity
Never overwrite your only copy unless your underlying write operation is known to be atomic
* If not, either not perform an operation or do an operation in its entirety
Lampson-Sturgis Assumptions
- Storage writes may fail and when they fail they may corrupt adjacent blocks
- Incorrect data might be written
- Later reads can later detect a badly written sector
- underlying disk hardware has error correction that can detect bad reads
- A storage can spontaneously decay causing it to go bad
- Later reads can detect the error
- Errors are rare occurrences
- Individual sector writes are atomic
Back to Contents
This document is fully HTML 4.01 compliant: