We can decompose FS robustness into subgoals:
Durability - Survives (limited) hardware failures. E.g. can survive power loss, but cannot survive computer melting.
Atomicity - Changes are done or not done.
Performance - Throughput and latency.
We focus on the "lose power" aspect since it is a simple problem. We need to deal with losing contents of RAM, CPU/controller caches, and registers. We only keep what was written to the disk.
Suppose you're writing to a disk, you only get a few microseconds to:
have the CPU execute a few instructions and
make the disk controller park its head.
We'd like to assume that sectors writes are atomic for the sake of this example.
However, it is important to remember that sector writes are not always atomic. For example, according to Zheng et al. FAST '13 "Understanding the robustness of SSD's under power fault," out of 15 SSD's, only 2 worked after power loss during operation. The other 13 suffered from:
Suppose we're trying to implement ^X^S (save) in emacs.
write(fd, buf, sector size)
We don't have an operating system for now. If we lose power here, we can lose data. To try circumvent this issue, suppose we have 2 devices.
Golden Rule of Atomicity: Never overwrite your only copy.
We have two blocks, A and B. We write A first with B's data, then overwrite B's data.
One problem with this is that we bloat data by 2x. The second, more significant problem is that if we lose power right before we overwrite B's data, we do not know whether A or B's data is correct:
The solution to this is to do three writes. Then if we have a power loss, we take the 2/3 majority that are the same. If A, B, C are all different, we take A as the correct bit.
Doing three writes solves bit corruption. Writes are not atomic, but they eventually finish since we know which bit is correct.
We use Eggert's Assumption to simply things for us.
if (lseek(fd, 0, SEEKSET) % 8192 == 0)
write(fd, buf, 8192); // assume already allocated
// assume BSD style file system
We update the time stamp first, then the data. This will cause 'make' to do extra work, but it will still succeed.
A common optimization is to do the write, dally, and then timestamp the file. This, however, has the issue of breaking 'make' since it depends on updated timestamps. The solution to this is to use a cleanup procedure following a reboot, i.e. 'make clean'.
rename("d/a", "d/b")
rename("d/a", "d/bb")
For the first operation, we find d's inode, then we find d's data and change the "a" to "b".
The second operation can be more difficult because it requires dealing with block allocation, as the file names are of different sizes.
rename("d/a", "e/b")
A problem that arises here is that when we write first, the link count is 1, although there are now two links pointing to the inode.
To fix this:
This way, the link count never under underestimates the true value. It can, however, overestimate, but this is better. This way we may lose disk space but cannot lose data.
It is important to have a model for invariants for your file system. An invariant is a condition that should always be true. Otherwise there can be serious consequences, like data loss.
Example of invariants for a BSD-style file system include:
The fourth invariant is the least important and not really an invariant since its consequence isn't too serious. Other invariants exist as well and vary for different file systems.
BEGIN
pre-commit phase // work is being done, but you can back out
COMMIT or ABORT // atomic at lower level
post-commit phase // work is being done, but cannot back out
END
This process appears to be atomic to the user.
It is possible to have only a journal and no cell blocks. This allows very fast writes and slow reads. Some cell storage can be cached to RAM
When the journal runs out, it can start over at the beginning of the jornal. This requires garbage collecting the journal as only some of the early parts of the journal can be used.
ext4 is an example of a journaled file system.
When replaying logs after a crash, for writeahead logs replay left to right. For writebehind logs, replay right to left.