Recall the inode and indirect block structure:
What kind of fragmentation and wasted space can we get with this structure?
If the file size is 0 bytes, we have no fragmentation. Technically, we have 48 bytes of inode space wasted: 4 bytes per direct block number plus 4 bytes for the indirect block number plus 4 bytes for the indirect^2 block number, which are all unused. If the file size is 1 byte, assuming a block size of 8 KiB (8192 bytes), we have 8191 bytes of wasted space in the block allocated for the file, plus 44 inode bytes wasted.
What if we create an empty file, seek very far into it, and write at that location? Consider the following code:
int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC, 066);
if (lseek(fd, 1000000000000, SEEK_SET) < 0)
error();
write(fd, "hello", 5);
This requires the creation of a doubly indirect block.
This is called a "holey file". Note that the file size is 1000000000005 bytes (although clearly it does not take up that much disk space). In this case, the wasted space is 44 bytes in the inode, 8188 bytes in the indirect^2 block, 8188 bytes in the indirect block, and 8187 bytes in the data block itself.
Even though the file "appears" to be a trillion bytes in size, if we run the following command:
$ ls -ls foo
We will see that foo doesn't actually use up very much disk space (the -s
argument to ls asks "how much space" does the file use, in blocks).
What if we try to apply this type of system design to a relational database, for instance?
Note that the database, which is a level of abstraction above the underlying file system, has many zero bytes. However, these are not actually on disk (similar to the lseek example above).
What problem can arise due to this? The underlying system can exhaust disk space before the database does. For example, the database may see these zeros and attempt to write to them/fill them with some data. Although the database sees this as empty space, the underlying system may actually be full (the disk has no space), since it does not actually store those zeros on disk.
Another downside to this system design is that it is somewhat slower because of non-sequential access; note that we have indirection until we finally access the raw data. In order to get to the data, the CPU and disk have to go through the data structure.
The following are layers of abstraction in a file system design:
Here are some examples of partitions:
These separate file systems are used to limit the damage caused by filling up or corrupting a separate file system. For example, if we completely fill up /usr
or /home
, it does not mean that /boot
is also out of space! Or else kernel space issues will result.
The mount table is the record of active file systems. Here's what it might look like, where the column "file system" denotes which file system the device, inode pair should map to:
dev ino file system
--- ---- -----------
1 97 ...
1 1096 ...
3 27 ...
2 35 ...
... ... ...
Side note: consider a path "C:\etc\passwd"
in Windows. The "C:"
part denotes the file system, and the "\etc\passwd"
part is the file name. In Unix, however, the mount table takes care of which file system device/inode pairs belong to. Therefore, the "C:"
part is abstracted away from the user (which is why we don't see paths like that).
What is a potential problem with this design? We might have loops in the mount table, creating an infinitely long tree. For example, if our mount table has the following information:
/etc --> file sys 3
/etc/foo --> file sys 4
/etc/foo/bar --> file sys 1
How do we fix this? Easy. The system call "mount" will only succeed if the target file system is not already mounted. For example, mount("/etc/foo", "/dev/rst/03", ...)
, fails if /dev/rst/03
is already mounted. An exception to this is the concept of loopback mounts, in which case already-mounted file systems are mountable, and you must be careful about loops.
Note that with this implementation, we can easily support multiple types of file systems at once. For example, /usr
could be type ext4, /home
could be type vfat, and /home/eggert
could be type nfs.
How are all of these different file system implementations supported? By using object-oriented (OO) programming. There is a common interface (API) for each file system implementation; in OO terms, the file system itself is like an object, and the Virtual File System (VFS) layer provides an OO implementation in C. You essentially implement a file system by subclassing the VFS layer, which provides standard operations, such as read/write (basic I/O), unlink, rename (metadata operations), etc.
As a historical sidenote: VFS object orientation was originally invented for NFS (Network File System) in 1985!
How are file names resolved and paths followed? For example, if a user runs open("abc/def/ghi", ...)
? Internally, namei
is used to map strings to their device, inode pair. It does a left-to-right traversal over a file name, exploring directories and keeping track of the current directory's inode. For example, after exploring abc/def
, the current directory inode will be def
's inode, and the file name component to resolve is ghi
. The new device, inode pair will then attempt to be resolved.
If a file is missing during this traversal process (for example, if def
is nowhere to be found once abc
is the current directory), then errno
is set to ENOENT
. If search permissions (the x
bit) are missing on a directory to be explored, errno
is set to EPERM
.
The file name resolution process starts its search in the current working directory. This is a per-process piece of information. If the file name starts with "/", however, we start the search at the root directory. The root directory is also a per-process piece of information.
How does file name expansion work on symbolic links? Say we run ls
and get the following:
$ ls -l a/b
lrwxrwxrwx eggert eggert 23 Jan 2015 b -> foo/bar
Then, the file name a/b/c
will be automatically resolved to a/foo/bar/c
. Our cursor will be switched to the expanded name.
For example, if our file system tree looks like the following, we are still allowed to run $ ln -s /bin/sh /usr/bin/sh
:
Are there any potential problems with any of these rules about symbolic links? Note that due to the fact that symbolic links can link to directories, we can get a loop, leading to an infinitely long file name. For example if we run $ ln -s ./ ./a/b/c
, then the path a/b/c/a/b/c/a/b/c/a/b/c/a/b/c/...
is valid and might be traversed during file name resolution! To fix this problem, the system places a hard cap on the number of symbolic link expansions allowed before namei
fails, returning ELOOP
(for example, at most 20 soft links can be traversed to get from the source to the target link).
Do symbolic links have their own inodes (and thus inode numbers) in the file system? This depends on the implementation.
If YES: there are two plausible implementations. In one, the symlink's inode has a pointer to the symlink's contents. In another, the symlink contains a string, representing a file name. Then, whenver this symlink is encountered, a filename expansion takes place, replacing the symlink's file name with the file name contained in its inode.
In this case, note that we can create hard links to symbolic links; in this case, two separate directory entries would contain the same inode number, which corresponds to the symbolic link's inode! The following example illustrates this:
$ ln -s a b
$ ln b c
$ ls -l
2 ... b -> a
2 ... c -> a
$ mv c /some/other/dir
If NO: the symlink information must be entirely contained within the directory entry for the symlink. While an ordinary file might have the file name and the inode number in the directory entry, perhaps the symlink would instead have the link name and the string corresponding to the linked (target) file.
As a side note/tangent, while on the topic of file system design... Some systems have an optimization to actually store very small files in the inode itself, without requiring any additional data blocks. For example, $ ls -ls
might show a file length of 7 bytes but show 0 bytes of disk space usage, since all 7 bytes of the file's data are self-contained in the inode.
As usual, there are potential pitfalls and technicalities...
What do permissions on a symbolic link mean? If we run ls
, we might see lrwxrwxrwx
. The l
indicates that the file is a symlink; but what does it mean for the symlink to have specific permissions? Actually, these permissions are junk; they don't mean anything at all. No matter what the permissions say, a symlink can be read but not written to.
What about the owner and group of a symlink? This is mostly junk as well. Anybody can read to a symlink, and as noted above, nobody can write to one. However, the owner and group are used for quotas (e.g. the symlink will take up space on certain disk quotas depending on who owns it).
How does ls
know a file is a symbolic link, and how does it know the corresponding character expansion that should take place? The following system calls help out:
lstat
, which is just like stat
, except that lstat
does NOT follow symlinks, while stat
does. Therefore, lstat
can give us information about the symlink itself, instead of information about the file that it points to.readlink
, which reads the value of a symbolic link (that is, the character expansion). For example: readlink("b", buf, bufsize)
, where b
is a symbolic link.unlink
, which deletes a "name". Similarly to lstat
, this does NOT follow symlinks, allowing us to delete a symlink without deleting the underlying file as well.If we run $ ls -l /dev/null
, we will see something like crw-rw-rw 1,3 /dev/null
. Note that where the size of the file typically is, instead we see 1,3
. What does this mean? These numbers are the driver number (1) and the device number (3). Notice that the file type is c
. This is a special file type. b
and p
are some other special file types. Other notable special files include /dev/zero
and /dev/full
.
Say we accidentally delete /dev/null
. This is very bad, because any programs that write to that location might now instead be writing to actual disk space, and the file will hog disk space with junk output that should have been discarded. As an example of special file creation, we can create /dev/null
like so:
$ mknod /dev/null c 1 3
Also, note that since /dev/null
has its own inode, we can link to it:
$ ln /dev/null haha
Another special file is a named pipe (FIFO) file, which behaves like a pipe but lives on disk instead of in memory:
$ mkfifo foo
$ ls foo
prw-rwrw- ______ foo
$ cat foo
The command $ cat foo
will hang until some other terminal does something like the following:
$ echo blah > foo
At this point, the original terminal will stop hanging, and cat
will output blah
to the console.
The root directory can be changed using the chroot
, which changes the apparent root for the current process and all children processes. Note that if we change our root from \
to \usr
, we can no longer see files "above us" in the directory tree, and therefore we can not change root back upwards. chroot
is typically used to create a "chrooted jail" where we can sandbox potentially dangerous programs; any damage that the program does will be isolated to the "chrooted jail", which is the only portion of the file system that the process can see.
Note that by convention, /..
is equivalent to /
. That is, the parent of root is root itself. This enforces the inability of a process to navigate upwards beyond root (as detailed above).
Is there a potential problem with allowing the root directory to be changed? Consider a situation where an attacker wants to become superuser. He runs su
and is given a password prompt:
$ su
password:
One thing the attacker might try is to create the following directory tree structure:
The attacker could alter the contents of etc
within playpen
, for example, by altering passwd
to set the root password to whatever he wishes (assuming he salts and hashes it correctly). He could then use chroot
to set the root directory to /home/eggert/playpen
. Then, during a subsequent attempt to gain superuser access with su
will look at this modified password information!
Becasue of this vulnerability that is introduced by chroot
, there is one more condition: only the superuser can use chroot
!
What happens if we have a bad sector? For example, say the underlying disk has a failure on a specific sector. In this case, the data is corrupted. If we try to do something like open("a/b/c", ...)
where b
is a symlink with contents in the bad block, we have a serious problem.
How should the file system handle this problem? Although the data is not readable, we don't want to crash the entire operating system. Instead, the proper convention is to make the I/O system calls fail and set errno
to -EIO
.
What if we want to corrupt or remove data intentionally (short of melting our hard drive into a ball of scrap metal), for example, before an attacker can get to it? Let's consider a few possible methods to do this:
$ chmod 0 file
Does this work? No. The attacker can still mount the file system himself, run $ chmod 444 file
, and have access to the file.
$ rm file
Assuming the underlying inode has just 1 link, and the file is not open somewhere. Does this work? No. Although the blocks containing the sentitive data have been freed/returned to the OS, the data is still living on the disk, unaltered, and can be accessed by an attacker.
$ shred file
This is more like it. shred
overwrites the file multiple times with random data, then it unlinks it, thereby effectively overwriting the original, sensitive-data-containing sectors. Why does it overwrite the file multiple times? With sufficient technology, even after an overwrite, an attacker might be able reclaim old data based due to magnetic residue on the hard drive ("data remanence"). Therefore, the more times the file is overwritten, the less likely any data will be recoverable.
Are there any problems with shred
? Note that an attacker can still deduce a pattern of holes in the original file as well as its likely size by analyzing the shredded memory. Another potential problem is that the file system might keep additional data about a file elsewhere (including, possibly, an entire backup of the file!), which will be unaffected by shredding the original. How do we fix this? We simply must make sure to shred at the partition level, not the file level. For example, $ shred /dev/dsk/03
.
Note that shred is pretty slow... Can we "shred" an entire file system in, say, a millisecond? Yes. Simply start with an encrypted file system, then throw away the key. Without the key, the encrypted data is irrecoverable.