File System Design

Lecture 11 — November 12, 2014

Prepared by Nathan Wong, Janice See, and Jason Zheng

Definition of a File System

A file system is a data structure that represents files (on disk and in RAM) that provides an abstraction representing a collection of files.

If a file system for only one file was to be made, it would be easy to implement. However, a file system for a large collection of files is desired. Essentially, this means a good data structure is needed.

The explanation of file systems will covered over four lectures, each describing an important aspect:

Performance -> Last Lecture
Organization -> This Lecture
Implementation -> Next Lecture
Robustness -> Next Next Lecture

Back to the top!

Storytime with Eggert

As an undergraduate student, Dr. Eggert did research under a quite peculiar and baffling professor. Basically, this professor was crazy.

The professor was unsatisfied with the current UNIX implementation.
The professor wanted to remake UNIX from scratch with a team composed of himself and two undergrads.
He assigned Dr. Eggert to create the file system of this "improved UNIX" OS.

Now Dr. Eggert did not know about UNIX at all during this period of his life. So when he was assigned to create the file system, he created what he thought was the best implementation in his undergraduate mind: a simple implementation.

Dr. Eggert's Dumb File System (1974)

This is a map of the disk using Dr. Eggert's Dumb File System. Each sector takes up 512 bytes. The first sector, split into 16 sections, is reserved as the directory for the files that will occupy the remaining sectors. We start by placing the beginning of each file on a sector boundary. The excess space not within the size allocation is marked as unused space by the OS (marked in green).

Each entry in the directory consists of 32 bytes. The first 16 bytes store the file name. The next 4 store the address of the sector boundary where the file begins (the first sector of the file). The next 4 bytes store the actual size of the file (in bytes). And the rest of the space stores other information, like the number of sectors allocated to the file.

When deleting a file (e.g. file 2), the sectors that it was formerly in need to be marked as free space.

Pros/Cons

SIMPLE
Files are allocated contiguously, and it's easy to code.
FEWER SEEKS
This is true especially if I/O is sequential.

MAX NUMBER OF FILES IS DETERMINED BY THE DIRECTORY SIZE
In fact, we can have at most 256 files with this design.
FILE SIZES (BOUNDS) MUST BE KNOWN IN ADVANCE
This is because everything must be preallocated prior to use. The size can be overshot, but never changed.
EXTERNAL FRAGMENTATION
Free space is available to use, but not for large requests. Small ones are okay, though. In example, a large request may need less space than the total free space available, but there is no one large block available for it to reside in.
INTERNAL FRAGMENTATION
This happens when a file with a size of 0 to 511 bytes is created, leaving extra unused space in between the sectors. It also occurs in allocated space not written to. This is allowed to happen because it is easy to read from sector boundaries. Having multiple files in sectors can cause unintentional overwriting, which is a pain and hassle.

RT-11 File System

Believe it or not, there are several file systems created in likeness of Dr. Eggert's Dumb File System. One of these is the RT-11, a real-time operating system. In certain cases, such file systems are desired because they have consistency and predictability in their timings. Think back to hard real time scheduling (safe and slow). In this case, people are willing to deal with the problems of these kinds of file systems.

Back to the top!

FAT - File Allocation Table

A solution to fix/avoid/lessen the problems (primarily fragmentation) of the contiguous file system was wanted. Enter FAT.

A fancier, more elegant file system than previously created.
No external fragmentation exists.
Small internal fragmentation present.
But this FS gives up performance and simplicity.

This is the new layout of the disk with the FAT structure instead. The block with address 0 is reserved for the boot sector. The block with address 1 is reserved for the superblock, which contains information about the file system including the version number, the size of the file system, the block size, and the location (block number) of the root directory. The metadata section contains the FAT. Its structure is illustrated below.

The FAT is essentially an array containing block numbers. There are three types of values stored in here, namely:

i: index of the next block in the file
null: EOF
-1: free block

The size of each section of the array is related to the number of blocks in data. Because each section is 16 bits, the largest size representing the number of sectors can be represented by 2^16 B. And the size of each sector is 4 KiB, or 2^16 B. Thus, the file system can be at most 2^12 * 2^16 = 2^28 B = 256 MB.

Consideration of Disk Formatting

In RAM, a pointer field (next) could be created to move from one block to another. Why not do this in disk? Why make an array of block numbers?

It would have to scan a lot of data.
The metadata can be hard to get at.
Security could be harder to maintain.
It is more possible for breaches of security to occur with such an implementation.
Programs like to read data in blocks of 2, but user block sizes are not in powers of 2.
It is a hassle to read 4 KiB each time -> Overhead

So FAT says make a linked list:

Payload on the right side.
File data is located in the 4 KiB sectors.
Next fields on the left side.
Array of block numbers essentially replaces the next fields.

Filenames

A question that still remains is: How can filenames and other information be stored?
For FAT, there exists an extra data structure that contains theses things:

The FAT directory entry begins with the file name, which consists of 11 bytes. The first 8 bytes are the actual name of the file and the last 3 bytes are the extension of the file (txt, doc, and others). The period seen in files on the computer like "foo.txt" is not stored. The next section contains the number of the first block index for that entry. Then afterwards comes other information like file size in bytes and option flags.

Regarding FAT directory entries, directories would be considered files as well. The only difference is a filetype flag that helps to determine which of the files are actually directories.

But how can the top level directory be found in this case?
To solve this problem, the root directory number is located in the superblock.

Pros/Cons

NO EXTERNAL FRAGMENTATION
Unlike contiguous allocation, available memory spaces will not be scattered.
FILES CAN GROW DYNAMICALLY
There is no need to preallocate things any longer.

LINKED LIST IMPLEMENTATION
Very slow access, especially with sequential access (seek is needed between each block).
- This can be fixed with a defragmenter.
  - There are numerous reliability issues.
  - Defragmenters are slow and are pigs.
INTERNAL FRAGMENTATION STILL EXISTS
Even with this implementation, this problem is still there. It cannot yet go completely away.

Back to the top!

A Problematic Situation of FAT

Though FAT solved several problems in terms of file systems, it contains difficulties of its own. This can be seen when comparing some seemingly simple commands.

EXAMPLE 1: mv xyz def
For FAT, this is easy to implement. The file is essentially just being renamed within its FAT data entry.

EXAMPLE 2: mv d/xyz e/xyz
At a first glance, this seems quite similar to the first example. In the user's perspective, this should act and be as easy to implement as the first example. However, this is not the case for FAT; this command is harder and causes more trouble. A graphical description of the steps needed to take is shown below:

As shown above, the process of moving between directories is much more complicated than renaming within the same directory. There are two steps the file system must take in order for this to happen:

A. Write to the new block (allocate xyz in e).
B. Rewrite the old block (replace block with zeros in d).

However, which of these two tasks are to be done first? Consider the case in which the computer crashes in the middle of this procedure:

If B is done first, data will be lost.

The computer crashes while data is being removed from one side.
That data being removed was not yet copied over to the other side.
The result is partial, unrecoverable data or a complete loss of data.
This is BAD. This must never happen.

If A is done first, two different directory entries point at the same file.

As an annoyance case, the file can be written to from both directory entries.
Even worse, this can cause an entire file system to turn to mush:

Consider the case where d/xyz and e/xyz both point to the same file.
The file would be deleted from one point (d/xyz) and thus the space would be "set free".
Now this free space can be allocated for other things, because it is considered unused.
HOWEVER, another point (e/xyz) still considers the freed space as its own.
This point (e/xyz) can write into the space where the file used to be while it is being used elsewhere.

Regardless of which action is done first, things can get problematic. So how did the original FAT respond to this situation? It errors out of it (quite the unsatisfactory solution).

How was this problem resolved? Was there another technique to fix it? Look no further than below.

Back to the top!

Inode-based File System

Unix v7 File System (1977)

On disk, there is an inode table that contains a fixed size amount of information about each file. Mainly, it contains the file name, which is 14 bytes, and an inode number, which is 2 bytes. Additionally, it contains information about the owner, time stamps, and size of file in bytes.

In the oldest version, renaming from one directory to another was still not allowed. But consider the following commands:

link("d/xyz", "e/xyz");
unlink("d/xyz");

The first command, link(), creates a hard link e/xyz. In result, there are two different directory entries that both specify the same inode. The second line, unlink(), then removes the hardlink d/xyz so that only one link remains to the file. This is essentially the same as the mv command.

BSD "Fast File System" (~1980)

The first and second sector are reserved for the boot and superblock, respectively. The next section is known as the block bitmap, which contains 1 bit per data block (D/2^16 bytes given D bytes size of data section). Each bit is a 1 if the respective data block is free and 0 if allocated. Afterwards is the inode table, which is shown above. It contains info about the files. And the rest of the space is reserved for data blocks, which are 8 KiB = 2^13 B each.

As it can be seen in the diagram, each section in the inode table contains much information. It first contains the information about the files, described above. Then it has 10 sections that store the inode numbers of the allocated blocks (0 - 9). If more than 10 blocks are allocated to the file, then an indirect block that contains sections that store inode numbers is created, which in this case can store up to 2^11 inode numbers. If that still is not enough sections to store the whole file, a double indirect block which contains sections that point to indirect blocks is created. In this case, a total of 2^11 * 2^11 = 2^22 inode numbers can be stored in the double indirect block. Altogether, the number of inodes that can be stored is 10 + 2^11 + 2^22, which was definitely sufficient for the total size of file systems during that time.

This file system has many similarities to FAT, especially how the inode table essentially replaces the FAT. However, one noticable thing is that to access the block information, the only block bitmap in comparison to a large FAT. Thus, there is less overhead.

Pros/Cons

FINDING FREE SPACE IS FASTER
There only needs to be a check in the block bitmat to see where something can be allocated.
HARD LINKS
This feature proves to be useful in multiple situations, allowing for capabilities that file systems did not use to have.
LSEEK IS FASTER
Rather than O(n), this implementation is O(1), constant regardless of the size of things.

MORE INTERNAL FRAGMENTATION
Blocks can still be not completely used. However, the BSD attempts to fix this problem for small files.

Reclaiming Storage with Link

Q: With link, how can you tell when you can reclaim storage?
A: Using a link count.

fd = open("file", O_RDWR | O_CREAT | ...);
unlink("file");
read(fd, ...);
write(fd, ...); // Lots of writes = disk fills up
fstat(fd, &st);

If st.st_nlink == 0, storage can be reclaimed.
Thus, the following conclusion can be made. Storage for a file can be reclaimed when:
1. Link Count == 0
2. The Last File Descriptor is Closed
The RAM keeps track of which files are open and records them.

Interpreting Filenames

open("a/b/c/d", ...);
namei("a/b/c/d"); // Returns inode number

The function namei() takes a filename and returns the inode number. So how does it work?
The function begins by going through some recursion or loop.

The function starts at the process working directory.
Take the file a/b/c/d for example:

We find a is in the current directory and find it's inode number.
We know a's inode number so finding a/b is next. b is in a.
We know a/b's inode number so finding a/b/c is next. c is in b.
We know a/b/c's inode number so finding a/b/c/d is next. d is in c.

Errors will be outputted if problems arise during the file search:

ENOTDIR if a file with the same name was found, but not a directory of such name.
ENOENT if no entry with such a name exists.

Back to the top!

More About Directory Settings

To change the process working directory, use the following command:

chdir("c");

c is the directory that is to become the working directory (wd).
chdir() changes the process's wd and returns the new wd's inode number.

In Linux, the command to change directories is cd and is used as such: $ cd foo
The first C compiled version of cd is shown below:

// cd.c
int main (int arg, char** argv) {
  ...
  // Create a new process.
  ...
  if (chdir(argv[1]) != 0) {
    ...
  }
  ...
  return 0;
}

However, there was a bug in this original program. This cd would change directories, but only for the child process spawned in the cd.c program. The parent process was left unchanged and thus, the user did not see a directory change.

If the file name is "/a/b/c/d" the / at the beginning of the name means that this location starts at the root directory.

To change the "root", use the following command:

chroot("c");

How is this useful? Consider the case where a user is provided a certain "root", which restricts his or her access, shown in the diagram below:

The smaller triangle represents what the user can access, which makes for a "protected" area outside of the smaller triangle and within the larger one. Thus, the root user can essentially prevent certain areas from being accessed by other users.

However, this command requires root access and is considered a "priviledged" command within the priviledged commands. Why? Consider the case from the diagram above where the smaller triangle is not present. There are two files in "/etc", known as passwd and shadow which contain the password information and can only be accessed by root. If another user was able to use chroot() and change the root directory, a "fake" passwd and shadow could be made within the smaller triangle's /etc that can be accessed by such user. The user can then use the passwords self-created to become a superuser which is VERY BAD.

Back to the top!

Next Lecture

Consider the following diagram:

Can a circular link dependency like this one be allowed? It seems like it would cause some trouble. How can we check for this situation?

Back to the top!

File System Design

Lecture 11 — November 12, 2014

Prepared by Nathan Wong, Janice See, and Jason Zheng

Table of Contents

Definition of a File System

Storytime with Eggert

Dr. Eggert's Dumb File System (1974)

Pros/Cons

RT-11 File System

FAT - File Allocation Table

Consideration of Disk Formatting

Filenames

Pros/Cons

A Problematic Situation of FAT

Inode-based File System

BSD "Fast File System" (~1980)

Pros/Cons

Reclaiming Storage with Link

Interpreting Filenames

More About Directory Settings

Next Lecture