Scribe Notes by: TaeSung Roh and Michael Liao

File System Robustness / Reliability

Terminology

we need to attack all 3!

File System Goals

Recall: transactions in main memory

Hard Links and Trouble

inode has a link count along with other stuff (data, block ptrs)
we have stuff like usr/bin/bash is really a link to bin/sh
we can do something like link /bin/sh","home/eggert/bin/s")

what's happening is that these links point at an inode, which has link count 3

the directory which we linked also has its own inode -> inside its data blocks we have a directory entry which points it at /bin/sh

rename("/home/eggert/bin/s", "/home/eggert/bin/shell")
when we do this, we need to mess with the data structure in the directory entry
what happens is we write the new disk block in RAM, when we start to write it to disk, we pull the plug! now we have a corrupted directory

rename("/home/eggert/bin/s","/home/eggert/test/sh");
we do two operations here: read bin dir into RAM, read test dir into RAM, clear "s" from bin in RAM, write bin block out, add "s" to test in RAM, write test block out

must assume that the system might crash between any two steps just named
we are ok if power failure occurs, unless we have a power failure between the write to block

what we do is we put write bin block out as the last step
if there is power failure now, between write test block out and write bin block out, we essentially do a link instead of a rename
the problem here is that the link count is never incremented
why it's bad: later, we can have a link count of 0, but still a link to the inode

we could also take the approach where we increment the link count on the test block out, subtract on the bin block out, but this doesn't work either
we’ll come back later

Another Problem: Hard links to directories

what happens if we link bin to usr/bin? the bin diretory has a link count of 3 (usr points to bin, bin points to usr/bin, . points to itself) Sub directory linking to parent directory: this causes a cycle
Hard links to directories are prohibited

How do we do this then? Symbolic link

Symbolic Links

Symbolic link has an inode with a data block, stuff in the in data block, like normal stuff, however in the type field, there is something that denotes a symlink
execvp("/bin/sh") - if "/bin/sh" is a symlink to usr/bin, also works recursively (if in turn usr/bin is a symlink, it links to that too)
symlinks don't count in the link count, so we can create a cycle this way

Pros:

Cons:

File System Invariants

Invariant: Boolean expression that’s always true
Let’s assume typical Unix f.s.

  1. every block of the f.s. is used for exactly 1 purpose (*: doesn’t matter)
    boot, super block, bitmap*, inode, data* (regular files - but for directories, it matters)
  2. all referenced blocks are initialized to valid data
  3. all referenced blocks in the data are marked used in bitmap
  4. all unreferenced blocks in the data are marked unused in bitmap

Consequences of Violations:

  1. Disaster
  2. Disaster
  3. Disaster
  4. Leak: wasted space

Example

rename(“d/a”, “e/b”)
read d
read e
read a’s inode
read b’s inode
+ 1 a’s link count in RAM
write a’s inode
write updated e
write updated d
- 1 a’s link count
Write a’s inode
(We violated 4 but not 1~3)


This requires 5 writes and a move! There “should” be only 2!

Interaction Between Performance & Robustness

Elevator algorithm + rename op.

Ops are not done in order requested (for performance reasons)
Fix: (write read) requests list dependencies (cheap to implement)
I/O to data blocks have no dependencies

ext4

Huge backwards compatibility issue

ext1|ext2|ext3|ext4

ext4: new data structures on disk

Some new stuff:

8 byte

+ 2 bits to time stamps (@ high-end)
+ 30 bits @ low end (ns resolution)
(ext3: 32-bit time_t value (1970-2038)

Extents

improve performance (128 MB each)

Containing 4KiB blocks

allocating contiguously
this is not backward compatible at all
if your file is file is <= 512 MB, you can be represented by 4 extents, if not, we use an indirect approach (similar to what we've seen before)

new system call "fallocate" - advises file system about expected size of file

delayed allocation

don’t extend on write(), extend on flush, improve performance

Valid HTML 4.01 Strict