CS 111: Lecture 12 - File system implementation

Scribe notes for 05/15/2012

By Daniel Daskalov

Levels of abstraction in a typical UNIX-ish file system

Sectors (ex: 512 bytes) - assumed to be a linear array with the "locality of reference" property
Blocks (ex: 8192 bytes) - same assumption as blocks
Partitions - a block of disk that holds a file system (all meta-data and data). A single disk can hold multiple file systems (This can even be taken to an extreme - we can create a big file in our file system and treat it as a partition! Then we can put a file system in it!)
inodes - in UNIX a "file" is equivalent to "inode + associated data"
File name components (ex: /usr/bin/sh) - each directory entry is a component of the file name
File names - the combination of all file name components gives the actual file name
Symbolic links - files whose contents is a file name

Let's say that we have the following shell command:

$ cat /usr/bin/sh | grep x

To implement the function "cat" we will need to use a syscall like:

open("/usr/bin/sh", O_RDONLY);

But we need a way to resolve the components of the file name. We can use an algorithm that would go through the file system and resolve the file name.

File name resolving

We first look at the root directory with inode number 1. Somewhere in the root directory there is a directory entry with a filename "usr" and inode number 9761 (for example).

Following the link to inode 9761 we find the directory "usr" and one of its directory entries is "bin" with inode number 296.

Following that link finally leads to the file "sh" that needs to be accessed with inode number 296.

Aside: Traditionally this process is sequential, but now some file systems with large directory structures have index directories using a hash table

Thus we are left with the following name resolution procedure:

D = current directory (per process)
For each file name component C
D = look up C in D's data
If D is a symbolic link to file S, modify list of filename components to be the actual path
Return D

We now have a new syscall: chdir("src/ms") (for the example file "src/ms") - this syscall uses the above algorithm and sets the process's current working directory to D

But if the file name starts with a "/": D = root directory.
Now we have another new syscall: chroot("/tmp/funny") - this system call is like "chdir" but changes a process's root directory.

NOTE: Once a process invokes "chroot" it is no longer able to return to a previous directory. (the previous directory of root is root) So this syscall should be used with caution. A term for this is "chroot jail" since the process is "jailed" in its own small directory space

In addition, chroot is also a priviledged system call and here is why:

Consider:

$ ls -l /bin/su
-r-sr-xr-x (other information about su)

This is trusted code and the "s" in the permissions indicates that the process runs as root
Now assume chroot isn't a protected syscall. A malicious user can do the following:

$ mkdir /tmp/jail/bin /tmp/jail/etc
$ ln /bin/su /tmp/jail/bin
$ create /tmp/jail/etc/passwd /tmp/jail/etc/shadow

Now the user can place any password in the passwd file and when he invokes chroot, the user places himself in a "chroot jail", but if he attempts to use a priviledged instruction and is asked for a password, su will look in the link that the user created where he placed a password the he knows. su will see that the passwords match and will give root priviledges to the user. Once the user obtains root priviledges, there are ways to escape the "chroot jail".

There is one more weakness in our name resolving procedure. If we create a symbolic link, for example, /src -> /usr/src, then the algorithm will try to replace the symbolic link with the actual path, but then it will encouter /src again and would try to do the same. Thus we can get a loop.
Similarly we can also get loops in the following situations:

$ ln -s bin bin

OR

$ ln -s a b
$ ln -s b a

To fix the problem, we add a counter to the algorithm that counts symbolic links. If the counter reaches, say, 20, then give up, fail, and set errno to -ELOOP.

When it comes to symbolic links, we can have some interesting results. Conisder the following situation:

$ ln -s a b
$ mkdir d
$ ln b d/b
$ echo foo > d/a
$cat d/b
foo

Then:

$ls -l d/a d/b b
(other info) b -> a
(other info) d/b -> a
(other info) d/a

These examples show that we can have hardlinks to symbolic links and also that symbolic links are resolved relative to their directories

Now consider this example:

$ ln -s /no/such/file nowhere
$ls -l nowhere
(other info) nowhere -> /no/such/file
$ cat nowhere
nowhere: cannot open: no such file or directory

Thus we can also have dangling symbolic links that lead nowhere

So which syscalls are symlink aware?
"open(...)" follows symlinks, "unlink("nowhere")" doesn't follow symlinks

Filling in some gaps about links in file system implementatons

How to implement them using syscalls?

$ ln -s a b			symlink("a", "b"); - creates a symlink and installs it in parent directory

$ ls -l b			lstat("b", &st); - doesn't follow symlinks

				stat("b", &st); - follows symbolic links

				readlink("b", buf, size); - find the name the link points to

$ ln a b			link("a", "b"); - creates a hard link

To find the link count, we can use:

$ ls -ln b
2961 ....... 2 a
2961 ....... 2 b

Where the -n option of ls prints link counts which are the numbers right before the file name.

Links allow us to have the following scenario where boxes are directories and circles are files:

links examples

unlink("a/b/c"); would only remove the link from directory "b". To remove a file for real, we must remove all directory entries for the file
The file system mainains a link count to keep track of the number of directory entries for a file (if the link count becomes 0, can we reclaim the storage? Only if the file is not currently open.)
Application should be able to handle this. Consider:

struct st;
if (fstat(0, &st) == 0) 		// 0 means stdin
	if (st.st_nlink == 0)
		print("ouch");

How can we cause a file to be open, yet have a link count of 0? Here is how:

$ (rm a; cat) < a

How about the following situation:

Directory link examples

To fix this issue, we just don't allow hard links to directories

Other types of files

Device controller

$ ls -l /dev/null
crw-rw-rw- 1 root root 1 3 May 15 11:29 /dev/null

The "c" in the ls output stands for "character special file". The numbers "1 3" are the major and minor device numbers: the major chooses a driver and the minor selects things from the driver.
To make such files:

$ mknod /dev/null c 1 3

This way we can create any device, but notice:

$ mknod rootdisk b 27 19 (random numbers) - this device file can talk to the root disk! So for this reason mknod is also priviledged

Pipe file

$ mkfifo /tmp/fifo - this creates a named pipe
$ ls -l /tmp/fifo
prw-rw-rw ....... /tmp/fifo - the p stands for pipe
$ cat /tmp/fifo - cat hangs waiting for input from the pipe
$ echo x > /tmp/fifo - then we type this in another terminal and cat in the first terminal prints "x"

Named pipes are also similar to socket files for network communication

So in summary we have these types of files in a UNIX-ish file system

directory
regular file
symlinks
special devices
fifos/sockets
...
shared memory objects

Sample Research File System

Let's look at a file system that is currently in development
It's designed to handle 120 PB = 120 000 TB = 120 000 000 GB
It is implemented on 200 000 hard drives (each is ~600 GB)

It's called GPFS - General Parallel File System (it's designed for speed)

Some ideas:

striping - blocks of data are distributed over multiple disks - can read from many disks in parallel
efficient directory indexing (ex: hash tables)
a distributed file system (meta data) - meta data is not on only one machine but distributed to all machines. We gain both performance and reliability
distributed locking - so we can deal with stale meta data (files are notified if their meta data is stale i.e. old)
partition awareness (network partition) - largest partition stays alive. If part of the file system loses connection, it can keep going. Re-sync later after connection is re-established