CS 111: Scribe Notes
File system design (11/04/2011)
Scott JohnsonHard Links
Short Recap
link("a", "b")- Creates a hard link
- Increases link count of file's inode
- Creates a new directory entries pointing to the same file (i.e. inode)
user$ ls -li
9382313 -rw-r--r-- 2 user staff 9 Nov 8 10:45 a
9382313 -rw-r--r-- 2 user staff 9 Nov 8 10:45 b
|-------| |-| |-|
inode link count file name
unlink("a")
- Opposite effect of
link
- Decrement link count in file's inode
- Remove the directory entry for file
a
- Space is not reclaimed until link count = 0
user$ ls -li
9382313 -rw-r--r-- 1 user staff 9 Nov 8 10:45 b
|-------| |-| |-|
inode link count file name
User considerations
Can a user create a directory entry pointing to garbage?
Would result in a dangling pointerA dangling pointer (or wild pointer) points to an unexpected value/object
- No user functions allow for the creation of dangling pointers
- e.g. the function
open("a", O_CREAT|...)
does not allow you to specify which inode number to assign
- e.g. the function
- There is no system call that allows you to choose (or change) an inode number for a file
What if we had a destroy
syscall?
Motivation for destroy
User
spock
downloaded a video recording of a PBS special on the psychological effects of pon farr, and now wants to delete it because his home computer is almost out of disk space. He executes rm ponfarr.mpeg
but no space has been freed! There must have been another hard link to it... Does he now have to search his computer for all hard links to the file?!
The
destroy(inode)
system call would
- Remove all hard links to a file
- Ensures the space used by the file is reclaimed because link count = 0
Cons: No practical way for the OS to where all the links are, so would have to either (1) search for the file starting at the root directory, or (2) complicate the directory structure by implementing a way to keep track of all hard links.
Partial solution:
ftruncate(fd)
Destroys all blocks in a file (resulting size = 0). Hard links remain, but to a file of size 0. This can be done at shell level (assume
big
is a big file): user$ > big
Creating DAGs and cycles in directories
Don't worry, Unix doesn't allow you to do this!Why doesn't Unix allow you to do this?
1)
cd a; pwd
Should this report that you are in a
or c
? 2)
cat a/d/a/d/a/d/a/d/b
Ridiculous! But would work.3)
find . -name b
Found it in a/b
, a/d/a/b
, a/d/a/d/a/b
, ... That could go on forever.
The great Unix solution: no hard links to directories
Symbolic Links
General Introduction
What are symbolic links (symlinks)?- Kind of like regular files, except the content is the name of another file
- OS substitutes the path of the other file for the name of the symlink - name resolution
- Can be created at the shell level with
ln -s SRC DST
- Symlinks do not have to point to files or directories that actually exist!
- i.e. dangling symlinks are OK, the file system is still consistent
- Hard links to symbolic links are OK -- you are referencing the symlink itself, not the file pointed to by the symlink
Take a look at the important directory entries for the above situation after the symbolic link is created (say, we want to access
sh
which is located in /usr/bin
)
The
bin
symlink stores the path /usr/bin
. If the user types /bin/sh
from the root directory (designated / in Unix), the OS does textual substitution resulting in the path /usr/bin/sh
. Note: symbolic links can be chained, resulting in multiple textual substitutions during name resolution. Potential problems with symbolic links?
- Shouldn't they suffer from the same cycle problem as hard links?
- Yes, so
find
and other problems have been written to not follow symlinks
- Yes, so
- What about infinite loop symlinks, e.g.
ln -s a b; ln -s b a
?- Unix only allows 20 iterations of symbolic link name resolution, so this "infinite loop" would only be iterated 20 times before returning an error
Connection back to GNU tar
problem
What was the GNU tar problem? New syscalls/ideas that attack this problem
- You can open a directory for read access (not write!) allowing you to "put your finger" on the file system e.g.
open("/usr", O_RDONLY,...)
openat(fd, filename, O_RDONLY,...)
- Uses a directory file descriptor, so even if someone plays the "symlink game," you will still be talking to the same directory
unlinkat(fd, filename)
- Again,
fd
ensures you are talking to the same directory
- Again,
stat
resolves symbolic links and returns information about the link references,lstat
does not resolve symbolic links and returns information about the link itselfopen(..., O_DIRECTORY)
Only open the file if it is a directory, i.e. the open will only succeed on a directoryopen(..., O_NOFOLLOW)
The open will fail if it is a symbolic link
File Name Resolution
- Look at the first byte. If
/
, start at the root directory, else start in the CWD (current working directory) - Walk through the string left -> right and look for filename components
- For each filename component, resolve that component by looking for directory entries that match the name of the component
- If one of those entries is a symlink, splice in the contents of the symlink and recurse
- Resolve the symlink relative to the symlink directory
Random notes on chdir and chroot
- Each process has a CWD as an inode number
- This can be changed with
chdir
- This can be changed with
- The
chroot
function will treat the specified directory as the root directory- "chrooted jail" - you can never got back to your old root directory, so make sure your new one is a hospitable place
chroot
can only be done by root
File System Robustness
Goals:- Good performance - high-throughput/low-latency
- Durability - if there is a failure in the general underlying system, e.g. hardware, the data survives
- Atomicity - data remains consistent if there is a failure: actions are either done or not done, e.g.
write("abc", ...)
will never give justab
Robustness pertaining to SAVE
in a text editor
SAVE
writes the contents of RAM to disk, in general taking several writes. What if we assume a power failure can occur at any time?- If a power failure occurs while save is overwriting your data, it will result in an inconsistent state
- Ok, then write to a copy and don't overwrite the original
Situation: You write to a copy and the power goes out. Which version is correct? The copy or the original?
- If the power went out while saving the copy, the original is correct
- If the power went out while overwriting the original, the copy is correct
- But how do you know which is correct?!
Commit Records
What are commit records?- Keep a record of all actions performed, e.g.successfully writing to the copy or successfully overwriting the original with the copy
- Assume that writing to the commit record is atomic
If we assume no atomicity in writing to the commit record
We want to write Y to the commit record to indicate that the contents of Y are correct instead of X. Let's look at a 1 block commit record over time if we assume the write is not atomic:What if we add another block to the commit record? (1) We write Y to commit record 1 while commit record 2 indicates X; (2) Write Y to commit record 2. At every point we have either X or Y:
Ok, we can actually solve this with 3 records. At any given time, if (record 1) == (record 2), then use that value, else use the value in (record 3).
So, do we really triply redundant commit records?
Lampson-Sturgis Assumptions
- Writes can fail
- Storage (blocks) can spontaneously decay
- The read will tell you if a block is bad**
- Errors are rare
- ... there are more but not directly pertinent ...
** If this is true, then we only need doubly redundant commit records, because we can tell if one of the blocks is bad, then just use the other
Adventures with rename
rename("a", "b")
- Read directory block
- Modify "a" -> "b"
- Write directory block
rename("d1/a", "d2/c")
- Read d1's blocks into RAM
- Read d2's blocks into RAM
- Modify main memory copy
- Write d1's data
- Write d2's data
- If the system crashes between 4 and 5, the file is gone: disaster
- If we switch steps 4 and 5 and crash in the same spot, you'll see 2 hard links to the same file with link count 1
- Unlinking 1 of these files will cause the link count to go to 0 --> file is deleted + dangling pointer: disaster!
- Before 1: read a's inode into RAM
- Before writing d2's data: write a's inode out with link count ++
- Then write d1
- Then write a's inode out with link count --
Correctness Invariants
- Every block should be used for exactly 1 purpose - data, indirect, bitmap, inode, superblock, boot block
- If a block is reachable, it should contain initialized data
- All reachable data blocks are marked as used in the bitmap
- All unreachable blocks are marked as free in the bitmap
- You lose data: DISASTER!
- You allocate data twice and lose data: DISASTER!
- You lose data: DISASTER!
- Actually not so bad, disk space leak, but the program
fsck
will try and fix this