CS111 Lecture Notes

Lecture 13 – Wednesday, February 27, 2008

By: Brian Hackel and Chris Hui

File System Robustness

Recall our rename(“a”, “b”) function from last lecture. It just “zaps” a block in-place. It is “atomic.” (quotes here are because we still do have to worry about the write to disk possibly failing, but otherwise, it is mostly robust.

Now consider a harder problem:

rename(“d1/a”, “d2/b”);

Here we have to change two things on the disk: the new directory and the old directory.

But what if the power fails in between these two changes?

A schedule for the above rename action could be as follows:

1) read d2’s directory contents into main memory

2) read d1’s directory contents into main memory

3) clear the “a” entry out of main memory (RAM) in d1 block

4) write d1 block to disk

5) create “b” entry in main memory (RAM) in d2 block

6) write d2 block to disk

But what can go wrong with this schedule?

Suppose we crash (power failure) between steps 4 and 5:

Now we have two directory entries pointing to the same inode.

So after reboot, say we unlink(“d1/a”).

Then the link count on the inode goes to zero, and we reclaim it with its data blocks.

So then if someone else creates a new file, it is given that inode (since it’s free now),

And then “d1/b” points to that new file!

This leads us to the notion of file system correctness invariants (Boolean expressions that are always true).

File System Correctness Invariants:

1) Each block is used for exactly one purpose: boot block, member of superblock, bitmap block, inode block, indirect block, data block, directory block.

2) All referenced blocks are initialized with appropriate data.

3) All referenced blocks are marked “used” in the bitmap.

4) All unreferenced blocks are marked “free” in the bitmap.

Consequences of violation of the above (punishment for breaking the law):

1) Disaster – we can lose data and likely crash.

a. For example, if we put data in an indirect block and then try to look up data blocks via that same indirect block, we’re likely going to be pointing toward garbage and crash.

2) Disaster – we can lose data and likely crash.

a. For example, we could end up writing over some data that is already on the disk

3) Disaster – we can have two files writing into the same data block because it was allocated twice (once for each file because its bit was never switched to “used” in the bitmap

4) Not disaster, but inefficiency – we can have a block that never gets used because the file system always reads it in the bitmap as “used” since the bit will never get switched to “free” because it does not really represent a used block.

a. Thus, if we HAVE to give up one of the invariants, this (#4) would be it, and most file systems do. On crash we might lose track of blocks and they never get used again.

b. In Unix: file system check program (fsck)

i. On reboot, it walks through the file system looking for unused blocks and repairs violations of #4 if any are found by switching the appropriate bit in the bitmap to “free” state.

Now we can revisit our schedule for rename() to make it satisfy the invariants:

Move steps 5 and 6 up to sit in between 2 and 3.

1) read d2’s directory contents into main memory

2) read d1’s directory contents into main memory

3) create “b” entry in main memory (RAM) in d2 block

4) write d2 block to disk

5) clear the “a” entry out of main memory (RAM) in d1 block

6) write d1 block to disk

But now in the case where we crash after new step 4 (writing d2’s block to disk), then we end up with two hard links to the file, but the link count in the inode is 1. Thus if we remove the file after reboot, the link count goes back to zero and the block gets reclaimed and thus we lose data.

To fix, we add new steps:

1) read d2’s directory contents into main memory

2) read d1’s directory contents into main memory

2.1) read inode of file “a” into RAM

2.2) increment the link count of the inode in RAM

2.3) write inode back to disk

3) create “b” entry in main memory (RAM) in d2 block

4) write d2 block to disk

5) clear the “a” entry out of main memory (RAM) in d1 block

6) write d1 block to disk

6.1) decrement link count of inode in RAM

6.2) write inode back to disk

But now we have 4 writes to disk and not 2. Doesn’t this cause MORE of a problem?

Yes, it is a lot slower, but consider what we’ve gained:

Consider the possible crash points in the new scheme: a crash between any pair of writes.

- If we crash after 2.3 and delete the file on reboot, we don’t lose data, but now robustness invariant #4 will be violated because we’ll have a block that is not linked to but cannot be used again.

- If we crash after 4 we have two hard links to the same file and have basically done the equivalent to “ln a b” rather than our intended “mv a b”. It won’t cause a crash. It’s not GOOD, but it won’t cause a disaster either.

- If we crash after 6, it is the same as the crash after 2.7: one hard link, but link count is 2.

- If we crash after 6.2, we’re ok because our rename() operation has completed.

File System Robustness

Terminology (conforming to IEEE standard):

error: a mistake made by the designer (in the designer’s head)

fault: defect in hardware or software that may cause a problem

failure: systems behaves incorrectly (perhaps from an earthquake or a stray gamma ray)

File System Goals – to be robust in all three of the above cases (or at least to survive most probably failures)

Ideally, we strive to make system failure rate lower than the failure rates of its constituent parts.

Durability: data should survive (limited) hardware failures

Atomicity: changes are either made or not; all or nothing

- our rename() function schedule fails this because a “rename” can turn into a “link”

Performance: high throughput, low latency (these are often competing goals though)

Suppose an application wants to do a lot of writes (all kinds of writes, just data)

The Operating System will batch these writes into a buffer such that the application thinks the writes are done, but they have not hit the disk yet.

Usually they are batched in disk order to minimize seeks, but this will mess with our robustness we worked so hard for previously in our rename() example!

Does this mean we cannot reorder writes at all? Can we have no optimization writing to disk? Or can we do some sort of reordering and still maintain our robustness?

Suggestions:

A) Reorder only sequences of writes to data bocks

Simple Example: write(fd, buf, 16384) //write 2 blocks

If we crash such that the second block wrote but the first did not (batched writes to data blocks), is this ok? It depends on our correctness criteria. But we have already defined our four invariants, and with them, this strategy works out!

B) Is it ok to reorder bitmap block writes?

That’s a great question for the FINAL EXAM! (think about it)

More general approach:

O.S. keeps track, for each written block, which read/writes it depends on (keeping a dependency graph for writes). And then the disk scheduler will check and respect these dependencies.

This is the FreeBSD ffs approach and is used in Mac OS X, but it took a lot of work to get it there.

Ideas for making the file system more robust:

1) Be careful about reordering writes

2) Journaling: This is an idea taken from double-entry bookkeeping in accounting. Whenever a transaction is made, it is recorded in two places, not just one. The records are permanent; once it is written down it cannot be changed or erased. “Use ink, never erase.” More about this idea later.

3) Duplicate writes/ RAID (Redundant Array of Inexpensive Disks): Use multiple hard drives to improve performance and robustness.

Ideas 1 and 2 protect against power failures. Idea 3 protects against disk fries.

If we go with the journaling approach, our system will be made of 2 parts:

Journal: The journal is stored on a separate file system (it might even be on a different disk). It is append-only, and entries can only be accessed sequentially (no seeks). Before we execute a write (actually changing the cell storage), we first have to make 2 entries in the journal:

1) The first entry tells the system that a write is taking place and what it will do.

2) The second entry tells the system that we are committed to performing the write.

Cell storage: The normal file system that we expect. Contains data blocks, indirect blocks, inodes, bitmap blocks, etc.

If our system crashes, fsck is run to check/repair the file system. If fsck sees a write that we were committed to, but never finished, it will update the cell data so it was if the write was carried out.

Assumptions/requirements:

1) Replaying journal must be idempotent (same results when it is done multiple times)

2) What’s in the log:

BEGIN

CHANGE- planned changes

CHANGE

…

OUTCOME- either a commit or abort

END- all the changes are accomplished in cell storage

Cache in controller: It would be bad if writes were cached, then the system crashed. Possible solutions:
A) Don’t use the cache for writes

B) Use a nonvolatile cache (NVRAM): Memory that lasts even when the power is off

C) Use a capacitor to act as an emergency battery/UPS (Ugly Power Supply)

D) A smart controller that knows about read/write ordering

Idea: Keep cell storage in RAM instead of disk. Since this takes up a lot of memory, we should only cache the parts we need (such as recent files)

-Pros: Writes will be faster.

-Cons: Reads will be slower. We will need a garbage collector since the journal will run out of space. The garbage collector has to be incremental (the garbage collects will have to take place as normal programs are running).

To implement a journal we can either use a write ahead log or a write behind log.

-Write ahead (Roll forward): The idea is that when the system crashes, we can still perform the writes we planned to do but weren’t able to. This is the idea that we have discussed so far.

1) Log all planned actions for a transaction

2) Commit

3) Install new values into database

If we run fsck with write ahead, the system will scan backwards through the journal looking for writes that we committed to, but were not carried out. Then it will do a second scan, going forward and applying the new values to the cell data.

-Write behind (Roll back): The idea is that when the system crashes, we restore the system back to its old state. We assume all cell data is persistent.

1) Log all old values

2) Install new values

3) Commit

If we run fsck with write ahead, the system will scan backwards through the journal looking for data that was installed but we did not commit to. We will undo these changes by installing the old data back. Note that this requires only 1 pass.

These ideas of write-ahead and write-behind can be mixed together.

RAID (Redundant Array of Inexpensive Disks) - This is the idea of using multiple hard drives to increase performance.

RAID 0: 2 hard drives connected to each other. Sets are striped, so that the data blocks alternate between the 2 disks.

Pros: Increased capacity, parallel I/O for faster performance

Cons: Less robust. Once a single disk fails, both disks fail since the data is mixed into both.

A1	A2
A3	A4
A5	A6
A7	A8

RAID 1: Mirroring: 2 hard drives connected to each other. Both drives contain the same data.

Pros: More robust. We have additional protection against crashes since the other disk acts as a backup. Increased read speed.

Cons: Slower write speed (we have to write to twice as many disks)

A1	A1
A2	A2
A3	A3
A4	A4