Notes prepared by Joachim Valente
Motivation | Symbolic Links | Durability
Consider a virus checker system inside a router running Linux (fig. 1). We would reasonably mount this virus checker as a independent filesystem inside the router's filesystem hierarchy (fig. 2). Since the viruses keep evolving we have to constantly update our system. This update requires a lot of operations: communication with the home base, recompiling the executables, copying files, modifying virus patterns, etc. Furthermore if we unplug the router during the update, we may end up with an inconsistent filesystem. To make this virus checker filesystem robust, we want:
Fig. 1 - A router running a virus checker.
Fig. 2 - Our file system including the virus checker.
To address the first problem let us try the following approach.
curl https://new-version.tar.gz # download the new version (mkdir /new ; tar xf) < new-version.tar.gz # untar it to /new mv /viruschecker /old # rename /viruschecker to /old mv /new /viruschecker # rename /new to /viruschecker
This way we actually do the move in lines 3 and 4. However there is still a major issue with this solution. Between lines 3 and 4, there is no virus checker at all, which obviously needs to be fixed. This turns out to be really challenging if we try to use only the tools we have seen so far.
The solution is to use symbolic links. In short, a symbolic link (also symlink or soft link) is a directory entry whose value is not a file but a filename. To create a symbolic link on Linux we use the ln command (like for hard links) with the -s option. Example code:
$ ln -s foo bar $ ls -l -rw-rw-r-- 1 ownr grp 9268 2013-05-18 19:09 foo lrwxrwxrwx 1 ownr grp 3 2013-05-18 19:10 bar -> foo
Notice the special marker l in the file permission field that stands for "link". Moreover the 777 permission has no special effect because in practice only the target file's permissions matter.
If the target file gets removed, the symbolic link will still exist but point to no existing file. Example:
$ rm foo $ cat bar cat: bar: No such file or directory
In fact, it is not even required at the creation of the symbolic link that the target file exist.
There is a system call to create symbolic links:
symlink("foo", "bar");
Symbolic links give us a solution to our original problem (assuming, of course, the atomicity of symlink). We will have two filesystems, called v1 and v2 (see fig. 3) and a symbolic link called viruschecker at the root level. At first viruschecker links to v1 (the old version), while the virus system gets updated and copied to v2. Finally we just have to make viruschecker point to v2. This last operation is atomic and we have obtained an all-or-nothing update mechanism.
Fig. 3 - Our virus checker implemented using a symbolic link.
There is only one issue with this solution, namely symlinks are read-only. That means we cannot change the file viruschecker points at. However we can instead create a new symbolic link and rename it to viruschecker.
ln -s /v2 /new-viruschecker mv /new-viruschecker /viruschecker
Like any other file, symlinks are traditionally implemented using inodes (fig. 4). A symlink is an inode of special type "l" where the data blocks point to the name of the target file.
Fig. 4 - Inode representation of a symbolic link.
There is a shortcut way, which is reusing the data pointer. Namely instead of pointing to the target filename, the data blocks contain the target filename. These symlinks have the special type "sl".
Finally there is an even shorter way of implementing symbolic links, which is storing their contents directly into a directory entry. However the downside of this method is that we cannot create hard links to a symbolic link.
Symbolic links are a delicate notion and we have to consider several issues. What can go wrong with symlinks?
$ mkdir bar && cd bar $ ln -s .. foo $ find . -name foo ./foo ./foo/bar/foo ./foo/bar/foo/bar/foo ...If we don't do anything about loops, the find command will loop. To prevent that behavior there are several solutions.
lstat("foo", ...);Similarly there is a O_NOFOLLOW option for the open system call:
open(..., O_NOFOLLOW);
ln -s a b ln -s b a cat aInternally cat calls open("a", ...) which obviously will loop forever. To prevent this, the bread crumb technique is too expensive. The solution to this is found in the kernel, that imposes a limit of 20 symlink traversals. In case this number is exceeded, open will fail and set errno to ELOOP. Why 20? Ask Bill Joy.
In our virus checker example, we were worried about power failures. We want our system to be reliable and consistent even if we unplug the machine.
As a reminder these are the goals of a file system.
char buf[1000000000]; fd = open(file, O_WRONLY); fwrite(fd, buf, size_buf); // write one page at a time
What kind of scenario can go wrong? Let us look at this example. We are trying to save a file.
fd = open("file", O_WRONLY); if (write(fd, buf, size_buf) != size_buf) error(); close(fd);
This simple code violates the golden rule of durability: never overwrite the only copy of your data. Instead we should read from file1 and save to a different file2.
Unfortunately this does not completely work. Say we now have a discrepancy between two copies of our data due to a power failure. Which one is the right one?
Fig. 5 - Discrepancy between two copies of the data. Which one is the right one?
One can try the following approach (see fig. 6):
Fig. 6 - Using two copies of the data. This approach is flawed since it is impossible to know which part is garbage and which part is not.
Unfortunately this does not work because if the reboot happens during an overwrite, we have no way of knowing if the first copy is corrupted or if the second copy if corrupted. However, by extending this scheme to three copies of the data, we can obtain something that works (fig. 7).
Of course this requires three times as much memory, and we do not want to burn two thirds of our disk storage for this purpose. Luckily we can impove this method, for example using commit records. However for these methods to be reliable, we must add some classical assumptions, like the existence of a checksum that can detect errors and even recover small errors, or the atomicity of sector writes, which are actually part of the so-called Lampson-Sturgir assumptions:
Fig. 7 - Using three copies of the data. This approach works... but requires three times as much memory. To choose a non-corrupted copy, the rule is choose the majority. In the only case where there are three different versions (middle case), choosing the top one allows us to retrieve the updated correct version.
Non-atomicity can lead to disasters. Let's consider this example:
rename("a", "b");
What is really happening is this sequence of operations:
In this example a simple rename actually leads to multiple atomic operations. We can make worse:
rename("a", "bbbbbbbbbbbbbbbbb");
Here we will have to deeply change the contents of the inode because of the filename length. Even worse:
rename("a/b", "c/d");
Here we have to modify two directories. Let us look at what can go wrong in this last example.
Case 1: update a then c. The following will happen:
If the power is cut between 2 and 3, then the file will be marked as referenced by nobody, and therefore will eventually be deleted and lost. This is, of course, unacceptable.
Case 2: update c then a. The following will happen:
If the power is cut between 3 and 4, the inode will be marked as referenced by two files, even though there is really only one hard link to it. Therefore if we later remove this hard link, the storage should be freed but will not because of this error in the link counter (note: the command fsck can fix that). What we have here is a memory leakage. This is not desirable; however this is not a disaster! Therefore this solution is preferred.
More generally, we want to make sure that file system invariants are conserved after an operation. File system invariants include:
The violation of the first three invariants would lead to disasters (corrupted filesystem or lost data). However the violation of the fourth one leads to a leak, which is not as bad.