LECTURE 13 NOTES: WEEK 8 WEDNESDAY

Scribe Notes by Joshua Adler

File System Robustness

Today we are looking at the BSD "fast" file system.
The BSD is comprised of:

boot block
super block: info about file system
bitmap: info about which blocks are free or not
inode table: array of inodes, 1 per file
data blocks

How can we find a bug in the file system code?
USE INVARIANTS

Invariants are boolean expressions that should always be true.
For file systems these include:

Each block in file system is used for exactly 1 purpose
If we change the invariant to less than or equal to 1 purpose the file system will still work, but might waste disk space
All referenced blocks are initialized to data appropriate for that block (inodes, directories etc.)
All referenced data blocks are marked as used in the bitmap
All unreferenced data blocks are marked free in the bitmap

Consequences of violating each invariant above:

Disaster (kernel/apps might crash, data overwritten, bitmap overwritten...)
Disaster - Looking at random data (OS may crash)
Disaster - overwrites data blocks (lost data, overwrite directories...)
Wasted space - block leak (permanenty unused block)

Performance Issues

Say the following actions occur

p1 ... write(fd, "hello", 5);
0.0002 seconds later
p2 ... read(fd, buf, 5);

If the write goes directly onto disk then the read will take a long time
Instead the kernel will look in cache to improve performance

BATCHING - user writes a byte at a time, leave in cache, then writes full block at once
DALLYING - don't write data for a little bit and leave in cache

SOME PROBLEMS

Since RAM is volatile, if we lose power write is lost
Even worse, a write success might have been visible to the user

Problem 1: might write out of order (due to circular elevator algorithm etc.)
Problem 2: Side effects might not be saved due to out of order writes

Ways to fix
Have app tell us when write hits the disk

Complicate the API (int fsync(int fd))
Returns integer for all the cached data for the file that is now written, returns -1 on bad disk

void sync(void)
Takes everything that is cached anywhere and schedules a write for it

When updating data you must change block of data + block with inode (at least inode mtime must be updated)
for this there is another system call: int fdatasync(int) which doesn't change the inode

Another example of a problem is rename("a" , "b");

Find working directory's inode in RAM
Get directories data into RAM - find "a" with correct inode and change "a" to "b"
Eventually cache will be written to disk

Now rename ("x/a" , "y/b") - rename from 1 directory to another

x's directory entries will have "a" linked to an inode
y's directory entries will be empty

Steps to do the rename will occur as follows:

Read inode
Read x's data
Read y's data
Write inode (linkcount = 2)
Write y's new data
Write x's new data
Write inode again (linkcount = 1)

This doesn't preserve all of our invariants because linkcount might be 2 instead of 1 after crash, but we only lose space
To combat this there is a command 'fsck' which looks at low level for incorrect link counts
This command runs on unmounted file systems only, and is often run after reboot

Interaction between Scheduling and Robustness

Say there is a set of pending requests
read block #9312
write block #497132

There are multiple scheduling algorithms that we can use such as FIFO or SSTF (Shortest Seek Time First)
FIFO will always work (might be slow) but SSTF might never reach a distant request in some cases

BSD Filesystem

We want a compromise that will be fast scheduling and avert disasters

data - ok to reorder data blocks (assume no fsync, fdatasync)
metadata - remember dependencies in nondata blocks
Tell which ones must go before others
Low level system must respect dependencies - but can still do reordering

Robustness Terminology/Theory

3 main goals of filesystems

DURABILITY - keep working even if there is a failure in underlying hardware (losing power)
ATOMICITY - changes are either all made or not made at all when failure occurs (rename "a", "b")
PERFORMANCE - throughput (operations per sec) + latency (how long for a request to be done)

How to implement atomicity on a device that isn't atomic itself

Atomic write - void awrite(char x) {write(x);}
In general we would: lock, then write, then unlock
This doesn't work for powerloss though

Golden Rule of Atomicity - Never overwrite last copy of data

For this example we will be renaming a file from A to B
One method is to have 3 copies of the data
If there is a powerloss choose the majority, if all different choose data I

I	A	?	B	B	B	B	B
II	A	A	A	?	B	B	B
III	A	A	A	A	A	?	B

This method works during powerloss at any point but it uses 3x disk space

Lampson-Sturgis Assumptions

Storage writes may fail, or may corrupt other blocks
A later read can detect the bad block
Storage con spontaneously decay (2 still applies)
Errors are rare
Repairs can be done in time for errors

Idea #1 Commit Record

Have another copy of the data blocks where we write the new version
Create a commit record that keeps track of what the correct set of data is (0 = old block, 1 = new block)

Idea #2 Journal

Make atomic changes to the cell data
The journal will be updated to have a log of all changes made and all planned writes