Presented by Paul Eggert on February 25, 2015
Scribe notes written by Ryan Voong and Frank Wang
While there are many invariants, in this class we focus on the following four conditions:
Obviously, violating of invariant 4 is the least harmful. When writing file system code under a strict time limit, it is important to prioritize.
One way to think of how a file system works: a data structure partially on disk or flash, and partially on RAM. We can think of RAM being a cache for what's on disk. Suppose process 1 does a write, and 0.0002 seconds later, process 2 does a read. The read occurs in the kernel and looks at the cache instead of the disk.
Dallying is basically delaying a request in hope that we can batch it future requests. For file systems, it occurs when the user does many writes, but the file system doesn't save data on disk or flash until a few milliseconds later because perhaps another write occurs. The system then tries to write them all at once. This can improve performance by delaying requests that, in the future, may not even be necessary.
Batching is combining several operations into one to reduce setup overhead. For file systems, this occurs when the user writes things one byte at a time, but leaves things sitting in the cache.
If we lose power, the write just doesn't happen because RAM is volatile. We can lose data that's written just before a crash as a result of tricks that supposedly improve performance. To some extent, this problem is not a big deal. How would we know the write doesn't get done? As long as the write success is not executed, then this may be acceptable.
Suppose the write success is visible to the user. Data that is stored on disk may not reflect the sequence of actions that the user did. Some side effects may not be saved due to out of order writes. One way to fix this issue is to complicate the API, using some new system calls:
int fsync(int fd) /* takes all of the data sitting cached and makes sure that on disk, all cache for this file has been written */
However, there is an issue with this: if we lose power during the fsync
call, then the result is a bad disk (specifically I/O failure). So let's try something else:
if (fsync(fd) != 0)) error();
if (write(fd, buf, 8192) != 8192) error();
if (fsync(fd != 0)) error();
For this implementation, we need 2 blocks. In the inode table, there is a block containing the inode for our file, and we need to change mtime
and size
. To fix this, we introduce another system call, int fdatasync(int)
. In this case, if all we've done is change the last modified time, we can use fdatasync
instead of the second fsync
. This will be twice as fast because you don't have to move between the inode table and the block.
One misconception is that the system call int close(fd)
will synchronize data to the disk. However, closing the file with this system call is completely unrelated to saving data on disk. It is not guaranteed to write cache to disk or flash.
What if we want to rename files? Let's consider rename("a", "b")
, which renames a file a
to b
.
"a"
by loading the directory data into RAM. Then, it changes "a"
into "b"
in the directory's data.If we use this approach, and there are multiple renames, we could encounter problems. In the event we lose power in the middle, it's possible that some files get renamed, and some don't. This is because we may be able to write out one cache entry, but not the others. This means that after reboot, we may lose some of the renames, but also have some later renames after the ones we lost succeed. But what if want the first rename to hit the disk?
Intuitively, one might try to use the sync()
function; this won't work because sync()
only schedules. Meanwhile, the other functions we learned about, like fdatasync(int fd)
and fsync(int fd)
require file descriptors. The problem at hand is that if you do renames in a directory, you want to make sure that the change in directory is actually in disk.
rename("x/a", "x/b");
dfd = open("x", O_RDLONLY);
fdatasync(dfd); //synchronizes directory
Unfortunately, a serious problem can occur when we rename from one directory to another. If the power goes out at an inopportune time, we can have a situation where two different directory entries are pointing at the same file, but the file only has a link count of 1. Of course, this can cause dangling pointers upon deletion. So what do we do if we want to rename a file and move it to a different directory? Let's consider rename("x/a", "y/b")
, which renames a file a
to b
and moves it from directory x
to y
. It needs to take steps to make sure that the file system's invariants are not violated if we lose power. Our approach is as follows.
x
's data.y
's data.y
's new data with a pointer to the file.x
's new data.This implementation does not preserve all invariants, but also will not cause a crash. At some point, the recorded link count is 2 while the actual link count is 1. There may be issues if the system loses power at that point.
fsck
is a tool that checks for errors in a file system. It works by:
/lost+found
directoryIt may be run only on unmounted file systems because all of the bytes are static and unchanging while fsck
is reading them. It is common to run fsck
after a reboot.
Suppose we have a set of pending I/O requests. We want to read block 932 and write block 49171631. We can use many different scheduling algorithms, such as SSTF (shortest seek time first) or FIFO (first in, first out).
If we use FIFO, we will have nice properties for file system robustness, but performance will suffer. Suppose we have a careful ordering of I/O:
write 39216
write 2196 // inode link count = 2
write 3721
write 2196 // inode link count = 1
If we use SSTF, this is what will happen:
write 2196 // the duplicate write has been removed, which is bad!
write 3721
write 39216
We want to find a compromise that has most of the performance of SSTF, but not the correctness disaster. In this compromise,
fsync
or fdatasync
)This compromise is used by BSD file systems.
There are three main goals to keep in mind when designing a file system:
rename ("a", "b")
, you want the file to be either renamed or not renamed. There shouldn't be two links to the same file or no links.Assume we're implementing it atop an underlying device where writes aren't atomic.
We have two ways to write:
nawrite("B"); // nonatomic write
awrite("B"); // atomic write
This is an implementation of awrite
that does not work:
void awrite(char x) {
// lock out all readers
write(x);
unlock(...);
} // THIS DOES NOT WORK
The golden rule of atomicity: when you write, never overwrite the last copy of your data.
Instead of having only one disk drive, have a device with multiple blocks or drives and keep two copies of the data. Now we must do two writes:
write(I, "B");
write(II, "B");
We always want to keep an old version of the data. After a reboot, we always use the newest version of the file. However, this presents a problem: we may lose data!
Instead, we can keep a third copy of the data to ensure that never lose data. But this presents an additional problem: we've tripled our required disk space.
These ideas will help us build a file system with good performance, atomicity, and durability.