CS 111: Scribe Notes

File system design (11/04/2011)

Scott Johnson

Hard Links

Short Recap

link("a", "b")

Creates a hard link
Increases link count of file's inode
Creates a new directory entries pointing to the same file (i.e. inode)


user$ ls -li

 9382313 -rw-r--r-- 2 user staff 9 Nov  8 10:45 a

 9382313 -rw-r--r-- 2 user staff 9 Nov  8 10:45 b


|-------|          |-|                        |-|

  inode        link count                  file name

unlink("a")

Opposite effect of link
Decrement link count in file's inode
Remove the directory entry for file a
Space is not reclaimed until link count = 0


user$ ls -li

 9382313 -rw-r--r-- 1 user staff 9 Nov  8 10:45 b


|-------|          |-|                        |-|

  inode        link count                  file name

User considerations

Can a user create a directory entry pointing to garbage?

Would result in a dangling pointer
A dangling pointer (or wild pointer) points to an unexpected value/object

No user functions allow for the creation of dangling pointers
- e.g. the function open("a", O_CREAT|...) does not allow you to specify which inode number to assign
There is no system call that allows you to choose (or change) an inode number for a file

What if we had a `destroy` syscall?

Motivation for destroy
User spock downloaded a video recording of a PBS special on the psychological effects of pon farr, and now wants to delete it because his home computer is almost out of disk space. He executes rm ponfarr.mpeg but no space has been freed! There must have been another hard link to it... Does he now have to search his computer for all hard links to the file?!

The destroy(inode) system call would

Remove all hard links to a file
Ensures the space used by the file is reclaimed because link count = 0

Pros: This function could immediately remove a file
Cons: No practical way for the OS to where all the links are, so would have to either (1) search for the file starting at the root directory, or (2) complicate the directory structure by implementing a way to keep track of all hard links.

Partial solution: ftruncate(fd)
Destroys all blocks in a file (resulting size = 0). Hard links remain, but to a file of size 0. This can be done at shell level (assume big is a big file): user$ > big

Creating DAGs and cycles in directories

Don't worry, Unix doesn't allow you to do this! dag and cycle

Why doesn't Unix allow you to do this?
1) cd a; pwd  Should this report that you are in a or c?
2) cat a/d/a/d/a/d/a/d/b  Ridiculous! But would work.
3) find . -name b  Found it in a/b, a/d/a/b, a/d/a/d/a/b, ... That could go on forever.

The great Unix solution: no hard links to directories

Symbolic Links

General Introduction

What are symbolic links (symlinks)?

Kind of like regular files, except the content is the name of another file
OS substitutes the path of the other file for the name of the symlink - name resolution
Can be created at the shell level with ln -s SRC DST
Symlinks do not have to point to files or directories that actually exist!
- i.e. dangling symlinks are OK, the file system is still consistent
Hard links to symbolic links are OK -- you are referencing the symlink itself, not the file pointed to by the symlink

Take a look at the important directory entries for the above situation after the symbolic link is created (say, we want to access sh which is located in /usr/bin) directory structure after symlink

The bin symlink stores the path /usr/bin. If the user types /bin/sh from the root directory (designated / in Unix), the OS does textual substitution resulting in the path /usr/bin/sh. Note: symbolic links can be chained, resulting in multiple textual substitutions during name resolution.

Potential problems with symbolic links?

Shouldn't they suffer from the same cycle problem as hard links?
- Yes, so find and other problems have been written to not follow symlinks
What about infinite loop symlinks, e.g. ln -s a b; ln -s b a?
- Unix only allows 20 iterations of symbolic link name resolution, so this "infinite loop" would only be iterated 20 times before returning an error

Connection back to GNU `tar` problem

What was the GNU tar problem?

New syscalls/ideas that attack this problem

You can open a directory for read access (not write!) allowing you to "put your finger" on the file system e.g. open("/usr", O_RDONLY,...)
openat(fd, filename, O_RDONLY,...)
- Uses a directory file descriptor, so even if someone plays the "symlink game," you will still be talking to the same directory
unlinkat(fd, filename)
- Again, fd ensures you are talking to the same directory
stat resolves symbolic links and returns information about the link references, lstat does not resolve symbolic links and returns information about the link itself
open(..., O_DIRECTORY) Only open the file if it is a directory, i.e. the open will only succeed on a directory
open(..., O_NOFOLLOW) The open will fail if it is a symbolic link

File Name Resolution

Look at the first byte. If /, start at the root directory, else start in the CWD (current working directory)
Walk through the string left -> right and look for filename components
For each filename component, resolve that component by looking for directory entries that match the name of the component
If one of those entries is a symlink, splice in the contents of the symlink and recurse
- Resolve the symlink relative to the symlink directory

Random notes on chdir and chroot

Each process has a CWD as an inode number
- This can be changed with chdir
The chroot function will treat the specified directory as the root directory
- "chrooted jail" - you can never got back to your old root directory, so make sure your new one is a hospitable place
- chroot can only be done by root

File System Robustness

Goals:

Good performance - high-throughput/low-latency
Durability - if there is a failure in the general underlying system, e.g. hardware, the data survives
Atomicity - data remains consistent if there is a failure: actions are either done or not done, e.g. write("abc", ...) will never give just ab

Note: We must set certain limitations. We can't expect durability if someone detonates a thermonuclear bomb next to our system.

Robustness pertaining to `SAVE` in a text editor

SAVE writes the contents of RAM to disk, in general taking several writes. What if we assume a power failure can occur at any time?

If a power failure occurs while save is overwriting your data, it will result in an inconsistent state

Golden rule of atomicity: Never overwrite your only copy of the data!

Ok, then write to a copy and don't overwrite the original

Situation: You write to a copy and the power goes out. Which version is correct? The copy or the original?

If the power went out while saving the copy, the original is correct
If the power went out while overwriting the original, the copy is correct
But how do you know which is correct?!

Commit Records

What are commit records?

Keep a record of all actions performed, e.g.successfully writing to the copy or successfully overwriting the original with the copy
Assume that writing to the commit record is atomic

If we assume no atomicity in writing to the commit record

We want to write Y to the commit record to indicate that the contents of Y are correct instead of X. Let's look at a 1 block commit record over time if we assume the write is not atomic:
1 commit record

What if we add another block to the commit record? (1) We write Y to commit record 1 while commit record 2 indicates X; (2) Write Y to commit record 2. At every point we have either X or Y:
2 commit records

Ok, we can actually solve this with 3 records. At any given time, if (record 1) == (record 2), then use that value, else use the value in (record 3).
3 commit records

So, do we really triply redundant commit records?

Lampson-Sturgis Assumptions

Writes can fail
Storage (blocks) can spontaneously decay
The read will tell you if a block is bad**
Errors are rare
... there are more but not directly pertinent ...

** If this is true, then we only need doubly redundant commit records, because we can tell if one of the blocks is bad, then just use the other

Adventures with `rename`

rename("a", "b")

Read directory block
Modify "a" -> "b"
Write directory block

If we assume (1) atomic writes and (2) atomic block writes, then we are fine - the system will never be in an inconsistent state.

rename("d1/a", "d2/c")

Read d1's blocks into RAM
Read d2's blocks into RAM
Modify main memory copy
Write d1's data
Write d2's data

If the system crashes between 4 and 5, the file is gone: disaster
If we switch steps 4 and 5 and crash in the same spot, you'll see 2 hard links to the same file with link count 1
- Unlinking 1 of these files will cause the link count to go to 0 --> file is deleted + dangling pointer: disaster!

Fix by complicating the algorithm:

Before 1: read a's inode into RAM
Before writing d2's data: write a's inode out with link count ++
Then write d1
Then write a's inode out with link count --

Correctness Invariants

Every block should be used for exactly 1 purpose - data, indirect, bitmap, inode, superblock, boot block
If a block is reachable, it should contain initialized data
All reachable data blocks are marked as used in the bitmap
All unreachable blocks are marked as free in the bitmap

Consequences of violating these (in same order as above):

You lose data: DISASTER!
You allocate data twice and lose data: DISASTER!
You lose data: DISASTER!
Actually not so bad, disk space leak, but the program fsck will try and fix this