CS 111 Lecture 14

COMPUTER SCIENCE 111 - Winter 2012

Lecture 14: File System Robustness

Scribes: Alexander Lee and Ian Yarbrough

Lecture Date: 02/29/12

Real Time Unix Files

These next few concepts are overflowed from the previous lecture on File System Implementation

Contiguous Files

Issue with file system organization

Interactions between inodes have to go through multiple levels of indirection before getting to the actual data block.
If we call an lseek(), we may actually be calling multiple lseeks each time we read the indirect indix from disk.
This leads to unpredictable performance, a lot slower than originally percieved

To prevent this, many real time unix systems have a different kind of file called "contiguous files". Within each inode there is a bit that indicates whether the current inode
is contiguous or not. In contiguous inodes, instead of having direct and indirect blocks, there is a pointer to a contiguous region of data on disk. This is similar to the RT11
file system, but this file system has the capability to support both contiguous and ordinary UNIX files.Since it is simply just a big chunk of disk, there would be no need for
using multiple lseeks()

To open a contiguous file, you simply specify that is is contiguous by specifying the flag. In the example below we specify the size of the file to be 1 GB.

open("file", O_CREATE | O_RDWER | O_CONTIG, 0644, 1024 * 1024 * 1024);

What's the downside?

fragmentation
need to specify size of file during creation

/dev/null

What is /dev/null?

Two functionalities

Any reads to /dev/null yields EOF
Any writes to /dev/null become discarded

Then how is /dev/null created?

mknod("/dev/null", magic numbers);

This system call, mknod, would be found in a boot script. The system call mknod() is able to create a "special" file with the
correct parameters. So now when using a ls-l /dev/null it will return:
crw-rw-rw- ... /dev/null
The inode for /dev/null has a bit that indicates whether it is a special file, and it has specific numbers to dictate which drivers
to specifically use for the file. The numbers are the above magic numbers specified upon file creation.



One issue that arises is that when creating special files with mknod() you can specify any magic numbers you want. Which can in turn
point to any device you want. This could lead to disaster! The simple solution is that only root can run the mknod() system call.

Named Pipes

A named pipe is similar to a regular pipe, except that it the scope of the pipe is longer than one process, also needs to be deleted. How to create a named pipe:

 $ mkfifo /tmp/mypipe
 $ ls -l /tmp/mypipe
prw-rw-rw- ... /tmp/mypipe

The p indicates that the file is a named pipe, examples on how to use the pipe:

 $ cat /tmp/mypipe > foo &   //Here the cat is reading from the pipe
 $ sed s/a/b/ /etc/passwd > /tmp/mypipe    //Here sed is writing to pipe, cat will then read

When the above commands finish, foo will contain a copy of the password with b substituted for a.

The named pipe has an inode because it is looked up by a directory entry just like a regular file, but there is no data stored on disk. The pipe is in RAM.
We have essentially "hijacked" the filesystem code to be used for a different purpose

Problem: Assume our machine has two file systems

Suppose we have two file systems on our machine.

/ (root file system)
/usr (read only)

/usr is a directory within the root directory which points to a completely different file system. One way to not confuse the two systems would be giving each file system a specified number
In UNIX this number is of type dev_t. In the inord, there is both an inode number (ino_t) and also a device number (dev_t) which tells which file system the inode is associated with.

Almost no real file systems use this method! Why? Suppose:

$ln /usr/bin/sh /bin/sh

There are two issues:

Since /usr was a read only file system, we can't modify the link count. To get this to work, we can't allow read only file systems
If you unmount /usr, then this link stops working. All file systems must be working in order for any of them to be working

A better way to get different file systems would be to have a mount table in the kernel

 dev_t  ino_t  dev_t
     1    2090    29

This maps the inode 2090, containing device 1, to the root directory of the file system with device number 29; this helps keep track of active mounted file system. This solves the link count problem,
since we don't need to be able the change the read only file systems link count, we just adjust the mount table located in main memory.

File System Robustness

Goals

Durability: Want the file system to survive, even if the hardware fails. (e.g. power turns off)
Atomicity: When we make a change to a file, that change should either happen, or not happen. No in between state.
High Throughput
Low Latency

These are competing goals, for example:

A good way to get higher performance is to dally. By doing so it gains performance by loses durability, it may tell the user a file has been written when it hasn't.

Text Editor SAVE operation

We have 10 blocks of data on disk. We have 10 blocks of new data sitting on RAM. To do a save, we could just write the 10 blocks in RAM onto disk one at a time.

This operation may or may not be atomic. If only one block was changed, and block writes are atomic then the save operation is atomic, otherwise it is not.

One way to ensure atomicity is by writing the memory buffer to a different temporary region in desk.

| old data | metadata

| new data | metadata

The new data may be partially written if the computer crashed in the middle of the operation

Golden Rule of Atomicity never overwrite your only copy on disk!

We could imagine a file system in which half of the disk is reserved for storing temporary results.

Atomic Higher-level Writing of Blocks

Problem:

 	How do we tell which data on disk to use? We need a "switch" function that tells which data to use. There will be meta information associated
	with each file to tell which of the two data to use. But how do we update the meta data atomically?

Now we must make assumptions for the underlying support for atomic update of metadata, we will make each assumption as simple as possible

Assumption:

 old state ----> ???? ----> new state

t -------------------->

1	old	?	new	new	new	new	new
2	old	old	old	?	new	new	new
3	old	old	old	old	old	?	new

time ---->

Each time we change a block from old to new it is a low-level write

After a crash, scan all 3 blocks and choose the majority file to determine a solid state that isn't metadata. If they all disagree, as in the fourth case, use the first block.

Not many file systems actually use this, because of performance factors (slow). The assumptions we made earlier are very conservative in nature, normally we use the assumptions below.

Lampson-Sturgis Assumptions

Storage writes may fail. A failed write to one block might corrupt another block. Reads can detect bad blocks produced by failed writes
Storage blocks can decay spontaneously (i.e. crazy burst of solar radiation). Again, a later read can detect this
Errors are rare

write(fd, buf, 8192); //This write also has to be aligned, must start at beginning of block

Atomically Renaming Files

rename("a", "b");

When renaming a file you need to write to the disk where the dentry is contained. You also need to write to the disk where the inode is currently contained to update the time stamp. Since we are
dealing with two separate blocks the operation is not atomic

Now what would happen is we tried this:

rename("d/a", "e/b");

One way to do this is to put a new directory "b" in "e", then delete "d/a". This way you won't lose the file if it crashes in the middle, but there is still a problem.
You could end up with both "d/a" and "e/b", but only a link count of 1. If you remove one of the two directory entries, the link count drops to 0, the storage is reclaimed,
and the other directory entry is now invalid.

	d's inode		data block		e's inode		data block		file inode
         ____________            _______________	  ____________		 _______________	 ____________
Before	|__|2010|____|		|__| a | 23 |___|	 |__|2009|____|		|_______________|	|___|1|______|
	 ____________		 _______________ 	  ____________		 _______________	 ____________	
After	|__|2012|____|		|_______________|	 |__|2012|____|		|__| b | 23 |___|	|___|1|______|

In this figure it shows the before and after states of the inodes and datablocks when a rename occurs. D and E's inode
both have the timestamp of the a and b, and these inodes point to the respective datablock that has the information of
a and b. Within these datablocks lie the actual filenames and the address of the file it points to. The actual file
inode shows the link count of the file, in between the states the link count can drop to 0, which is the problem described
earlier.

One approach would be:

Increment the link count (write to inode)
Write the destination directory block ("e/b")
Write the source directory block (delete "d/a")
Decrement link count (write to inode)

One issue with the above approach is that if we crash inbetween 1 and 2, or 3 and 4, the link count would be too high, which then leads to a possible storage leak.
This is still the better than the first option. Systems routinely run clean up processes, fsck, which finds leaks and reclaims memory, resolving the link count issue.
It is the better option than simply crashing

File System Correctness Invariants

Every block is used for exactly 1 purpose
All referenced blocks are initialized to data that is appropriate for its type
All referenced blocks are marked as "used" in the bitmap
All unreferenced blocks are marked "free" in the bitmap

1	old	?	new	new	new	new	new
2	old	old	old	?	new	new	new
3	old	old	old	old	old	?	new

1	old	?	new	new	new	new	new
2	old	old	old	?	new	new	new
3	old	old	old	old	old	?	new

1	old	?	new	new	new	new	new
2	old	old	old	?	new	new	new
3	old	old	old	old	old	?	new