File System Implementation

Lecture 12 by Professor Eggert, UCLA CS 111

February 23, 2015 (Monday, Week 8, Fall 2015)

Transcribed by Ryan Baker and Anthony Zhu

Fragmentation

Recall the inode and indirect block structure:

What kind of fragmentation and wasted space can we get with this structure?

If the file size is 0 bytes, we have no fragmentation. Technically, we have 48 bytes of inode space wasted: 4 bytes per direct block number plus 4 bytes for the indirect block number plus 4 bytes for the indirect^2 block number, which are all unused. If the file size is 1 byte, assuming a block size of 8 KiB (8192 bytes), we have 8191 bytes of wasted space in the block allocated for the file, plus 44 inode bytes wasted.

Holey Files

What if we create an empty file, seek very far into it, and write at that location? Consider the following code:


        int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC, 066);

        if (lseek(fd, 1000000000000, SEEK_SET) < 0)

          error();

        write(fd, "hello", 5);

This requires the creation of a doubly indirect block.

This is called a "holey file". Note that the file size is 1000000000005 bytes (although clearly it does not take up that much disk space). In this case, the wasted space is 44 bytes in the inode, 8188 bytes in the indirect^2 block, 8188 bytes in the indirect block, and 8187 bytes in the data block itself.

Even though the file "appears" to be a trillion bytes in size, if we run the following command:


      $ ls -ls foo

We will see that foo doesn't actually use up very much disk space (the -s argument to ls asks "how much space" does the file use, in blocks).

Holey Database

What if we try to apply this type of system design to a relational database, for instance?

Note that the database, which is a level of abstraction above the underlying file system, has many zero bytes. However, these are not actually on disk (similar to the lseek example above).

What problem can arise due to this? The underlying system can exhaust disk space before the database does. For example, the database may see these zeros and attempt to write to them/fill them with some data. Although the database sees this as empty space, the underlying system may actually be full (the disk has no space), since it does not actually store those zeros on disk.

Another downside to this system design is that it is somewhat slower because of non-sequential access; note that we have indirection until we finally access the raw data. In order to get to the data, the CPU and disk have to go through the data structure.

File System Implementation (Continued)

Layers of Abstraction

The following are layers of abstraction in a file system design:

Symbolic links
File names (have "/")
File name components (don't have "/")
inodes
File systems and mounts
Partitions
Blocks (usually a power of two multiple of sector length, say 8 KiB)
Sectors (fairly small, say 512 bytes)

Here are some examples of partitions:

These separate file systems are used to limit the damage caused by filling up or corrupting a separate file system. For example, if we completely fill up /usr or /home, it does not mean that /boot is also out of space! Or else kernel space issues will result.

Mount Table

The mount table is the record of active file systems. Here's what it might look like, where the column "file system" denotes which file system the device, inode pair should map to:


        dev  ino   file system

        ---  ----  -----------

        1    97    ...

        1    1096  ...

        3    27    ...

        2    35    ...

        ...  ...   ...

Side note: consider a path "C:\etc\passwd" in Windows. The "C:" part denotes the file system, and the "\etc\passwd" part is the file name. In Unix, however, the mount table takes care of which file system device/inode pairs belong to. Therefore, the "C:" part is abstracted away from the user (which is why we don't see paths like that).

What is a potential problem with this design? We might have loops in the mount table, creating an infinitely long tree. For example, if our mount table has the following information:


      /etc --> file sys 3

      /etc/foo --> file sys 4

      /etc/foo/bar --> file sys 1

How do we fix this? Easy. The system call "mount" will only succeed if the target file system is not already mounted. For example, mount("/etc/foo", "/dev/rst/03", ...), fails if /dev/rst/03 is already mounted. An exception to this is the concept of loopback mounts, in which case already-mounted file systems are mountable, and you must be careful about loops.

Note that with this implementation, we can easily support multiple types of file systems at once. For example, /usr could be type ext4, /home could be type vfat, and /home/eggert could be type nfs.

How are all of these different file system implementations supported? By using object-oriented (OO) programming. There is a common interface (API) for each file system implementation; in OO terms, the file system itself is like an object, and the Virtual File System (VFS) layer provides an OO implementation in C. You essentially implement a file system by subclassing the VFS layer, which provides standard operations, such as read/write (basic I/O), unlink, rename (metadata operations), etc.

As a historical sidenote: VFS object orientation was originally invented for NFS (Network File System) in 1985!

File Name Resolution

How are file names resolved and paths followed? For example, if a user runs open("abc/def/ghi", ...)? Internally, namei is used to map strings to their device, inode pair. It does a left-to-right traversal over a file name, exploring directories and keeping track of the current directory's inode. For example, after exploring abc/def, the current directory inode will be def's inode, and the file name component to resolve is ghi. The new device, inode pair will then attempt to be resolved.

If a file is missing during this traversal process (for example, if def is nowhere to be found once abc is the current directory), then errno is set to ENOENT. If search permissions (the x bit) are missing on a directory to be explored, errno is set to EPERM.

The file name resolution process starts its search in the current working directory. This is a per-process piece of information. If the file name starts with "/", however, we start the search at the root directory. The root directory is also a per-process piece of information.

Symbolic Links

How does file name expansion work on symbolic links? Say we run ls and get the following:


        $ ls -l a/b

          lrwxrwxrwx eggert eggert 23 Jan 2015 b -> foo/bar

Then, the file name a/b/c will be automatically resolved to a/foo/bar/c. Our cursor will be switched to the expanded name.

Symbolic Links vs. Hard Links

Symbolic links can cross file system boundaries; hard links cannot
Symbolic links can link to a directory; hard links cannot
Symbolic links can link to a nonexistent file; hard links cannot

For example, if our file system tree looks like the following, we are still allowed to run $ ln -s /bin/sh /usr/bin/sh:

Are there any potential problems with any of these rules about symbolic links? Note that due to the fact that symbolic links can link to directories, we can get a loop, leading to an infinitely long file name. For example if we run $ ln -s ./ ./a/b/c, then the path a/b/c/a/b/c/a/b/c/a/b/c/a/b/c/... is valid and might be traversed during file name resolution! To fix this problem, the system places a hard cap on the number of symbolic link expansions allowed before namei fails, returning ELOOP (for example, at most 20 soft links can be traversed to get from the source to the target link).

Symbolic Link Implementation

Do symbolic links have their own inodes (and thus inode numbers) in the file system? This depends on the implementation.

If YES: there are two plausible implementations. In one, the symlink's inode has a pointer to the symlink's contents. In another, the symlink contains a string, representing a file name. Then, whenver this symlink is encountered, a filename expansion takes place, replacing the symlink's file name with the file name contained in its inode.

In this case, note that we can create hard links to symbolic links; in this case, two separate directory entries would contain the same inode number, which corresponds to the symbolic link's inode! The following example illustrates this:


        $ ln -s a b

        $ ln b c

        $ ls -l

          2 ... b -> a

          2 ... c -> a

        $ mv c /some/other/dir

If NO: the symlink information must be entirely contained within the directory entry for the symlink. While an ordinary file might have the file name and the inode number in the directory entry, perhaps the symlink would instead have the link name and the string corresponding to the linked (target) file.

As a side note/tangent, while on the topic of file system design... Some systems have an optimization to actually store very small files in the inode itself, without requiring any additional data blocks. For example, $ ls -ls might show a file length of 7 bytes but show 0 bytes of disk space usage, since all 7 bytes of the file's data are self-contained in the inode.

As usual, there are potential pitfalls and technicalities...

What do permissions on a symbolic link mean? If we run ls, we might see lrwxrwxrwx. The l indicates that the file is a symlink; but what does it mean for the symlink to have specific permissions? Actually, these permissions are junk; they don't mean anything at all. No matter what the permissions say, a symlink can be read but not written to.

What about the owner and group of a symlink? This is mostly junk as well. Anybody can read to a symlink, and as noted above, nobody can write to one. However, the owner and group are used for quotas (e.g. the symlink will take up space on certain disk quotas depending on who owns it).

How does ls know a file is a symbolic link, and how does it know the corresponding character expansion that should take place? The following system calls help out:

lstat, which is just like stat, except that lstat does NOT follow symlinks, while stat does. Therefore, lstat can give us information about the symlink itself, instead of information about the file that it points to.
readlink, which reads the value of a symbolic link (that is, the character expansion). For example: readlink("b", buf, bufsize), where b is a symbolic link.
unlink, which deletes a "name". Similarly to lstat, this does NOT follow symlinks, allowing us to delete a symlink without deleting the underlying file as well.

Special Files

If we run $ ls -l /dev/null, we will see something like crw-rw-rw 1,3 /dev/null. Note that where the size of the file typically is, instead we see 1,3. What does this mean? These numbers are the driver number (1) and the device number (3). Notice that the file type is c. This is a special file type. b and p are some other special file types. Other notable special files include /dev/zero and /dev/full.

Say we accidentally delete /dev/null. This is very bad, because any programs that write to that location might now instead be writing to actual disk space, and the file will hog disk space with junk output that should have been discarded. As an example of special file creation, we can create /dev/null like so:


        $ mknod /dev/null c 1 3

Also, note that since /dev/null has its own inode, we can link to it:


        $ ln /dev/null haha

Another special file is a named pipe (FIFO) file, which behaves like a pipe but lives on disk instead of in memory:


        $ mkfifo foo

	$ ls foo

	  prw-rwrw- ______ foo

	$ cat foo

The command $ cat foo will hang until some other terminal does something like the following:


        $ echo blah > foo

At this point, the original terminal will stop hanging, and cat will output blah to the console.

Changing Root

The root directory can be changed using the chroot, which changes the apparent root for the current process and all children processes. Note that if we change our root from \ to \usr, we can no longer see files "above us" in the directory tree, and therefore we can not change root back upwards. chroot is typically used to create a "chrooted jail" where we can sandbox potentially dangerous programs; any damage that the program does will be isolated to the "chrooted jail", which is the only portion of the file system that the process can see.

Note that by convention, /.. is equivalent to /. That is, the parent of root is root itself. This enforces the inability of a process to navigate upwards beyond root (as detailed above).

Is there a potential problem with allowing the root directory to be changed? Consider a situation where an attacker wants to become superuser. He runs su and is given a password prompt:


        $ su

          password:

One thing the attacker might try is to create the following directory tree structure:

The attacker could alter the contents of etc within playpen, for example, by altering passwd to set the root password to whatever he wishes (assuming he salts and hashes it correctly). He could then use chroot to set the root directory to /home/eggert/playpen. Then, during a subsequent attempt to gain superuser access with su will look at this modified password information!

Becasue of this vulnerability that is introduced by chroot, there is one more condition: only the superuser can use chroot!

File System Data Corruption

Bad Sector Data

What happens if we have a bad sector? For example, say the underlying disk has a failure on a specific sector. In this case, the data is corrupted. If we try to do something like open("a/b/c", ...) where b is a symlink with contents in the bad block, we have a serious problem.

How should the file system handle this problem? Although the data is not readable, we don't want to crash the entire operating system. Instead, the proper convention is to make the I/O system calls fail and set errno to -EIO.

Intentional Data Corruption/Removal

What if we want to corrupt or remove data intentionally (short of melting our hard drive into a ball of scrap metal), for example, before an attacker can get to it? Let's consider a few possible methods to do this:

$ chmod 0 file

Does this work? No. The attacker can still mount the file system himself, run $ chmod 444 file, and have access to the file.
$ rm file

Assuming the underlying inode has just 1 link, and the file is not open somewhere. Does this work? No. Although the blocks containing the sentitive data have been freed/returned to the OS, the data is still living on the disk, unaltered, and can be accessed by an attacker.
$ shred file

This is more like it. shred overwrites the file multiple times with random data, then it unlinks it, thereby effectively overwriting the original, sensitive-data-containing sectors. Why does it overwrite the file multiple times? With sufficient technology, even after an overwrite, an attacker might be able reclaim old data based due to magnetic residue on the hard drive ("data remanence"). Therefore, the more times the file is overwritten, the less likely any data will be recoverable.

Are there any problems with shred? Note that an attacker can still deduce a pattern of holes in the original file as well as its likely size by analyzing the shredded memory. Another potential problem is that the file system might keep additional data about a file elsewhere (including, possibly, an entire backup of the file!), which will be unaffected by shredding the original. How do we fix this? We simply must make sure to shred at the partition level, not the file level. For example, $ shred /dev/dsk/03.

Note that shred is pretty slow... Can we "shred" an entire file system in, say, a millisecond? Yes. Simply start with an encrypted file system, then throw away the key. Without the key, the encrypted data is irrecoverable.