By Charley Chen and Richard Souleles
Invariants are boolean expressions that should always be true. For example 0 < 1.
1. Each block is used for exactly one purpose.
- Such as superblock, bitmap block, inode block, data block
2. All reference blocks are initialized to data appropriate for that block.
- Inode entries should not have negative file sizes
3. All referenced data blocks are marked as used in the bitmap.
4. All unreferenced data blocks are marked as free in the bitmap.
There are more invariants in addition to these four. For example: If an inode is not being used, it should be marked as such.
If invariants are violated, code is buggy
Violating file system invariants could lead to disaster. If we violate the following invariants:
Invariant 1- Using the same block for two purposes or overwriting the bitmap leads to lost data.
Invariant 2- We might overwrite data or crash if we follow uninitialized pointers
Invariant 3- We could overwrite directory blocks or used blocks and lose data.
Invariant 4- Waste alot of space. Memory leaks.
Our system can improve performance by utilizing RAM. It is faster to read from RAM than from disk, and we can think of RAM as a cache for the disk controller.
Batching is performing several requests as a group to avoid the setup overhead of doing them one at a time. For example, the user writes a byte at a time into the cache and eventually the collected data is written out to disk or flash.
Dallying is delaying a request in hopes that the operation will not be needed later, or that more opportunities are created for batching. The user can delay a write request so that when a second write to the same block comes along, the user can delete the first request and perform just the second one. This approach is sometimes called write absorption .
However, data written to RAM can be lost if there is a power failure or crash and the data has not been saved to disk or flash.
Some operating systems may utilize circular elevator scheduling for write requests. This improves performance, but if there is a crash or power failure the data on disk may not reflect the actual sequence of actions that user did. Since there may have been out of order writes some writes or side effects may not be saved. One solution: we can complicate the API and allow the user to tell the operating system when to save writes in cache to disk. These are three flavors of doing this available to the user:
int fsync(int fd)
- Write all the cached data for fd into disk and return an integer to notify the user if the save worked
- Slow since we need to wait for all of the data to be saved to disk
void sync(void)
- Schedules writes for all caches and returns immediately
- Important to note that when sync returns data has not been saved
int fdatasync(int fd)
- Works just like fsync but does not modify inode table if file size has not changed
- Runs twice as fast as fysnc since it only does one disk write for the data and does not update the file's inode
Important note: The system call close has nothing to do with saving data on disk since it does not write cache to disk or flash. There are no guarantees that the data will be saved after calls to close.
rename("a", "b");
1. Get working directory's inode and read it into RAM
2. Get directories' data into RAM, change the file name from a to b in cache
3. Eventually cache written to disk
rename("x/a" , "y/b");
0. Read directory inodes into cache
1. Read directory x's data into cache
2. Read y's data into cache
3. Increment file inode link count to 2 and write to disk
4. Update directory y and write its data to disk
5. Update directory x and write its new data
6. Change file's link count back to 1 and write inode again
An engineer at Bell Labs got really tired of incorrect link counts in file systems so he wrote a program called File System Check
fsck -run on an umounted system. Reads bytes off the disk and checks link counts. It places orphaned files, files with a non zero link count and nothing pointing to it, into the lost+found directory.
Say there is a set of pending I/O requests, and you need careful ordering, for example:
Write 39216
Write 2196 link count 2
Write 37215
Write 2196 link count 1
With some forms of scheduling, such as SSTF, the computer could see that there is two writes to the same location and ignore the first one, and skip to the second to save time. However, we need that first write to occur incase we lose power at that moment.
There is a compromise (BSD File System) with most of the performance of SSTF.
1. OK to reorder data blocks (assume no fsync or fdatasync)
2. Remember dependencies in non-data blocks, low level system must
respect dependencies
Three main goals of a file system:
1. DURABILITY survive failures in underlying hardware
2. ATOMICITY changes are either all made or not made at all
3. PERFORMANCE throughput + latency
To implement atomicity atop a device where writes are not atomic, we need to have three copies of the file. Say we are changing the contents from A to B, the three files would change as shown below.
1. A ? B B B B B
2. A A A ? B B B
3. A A A A A ? B
If the computer crashes or loses power in the middle of the write, you keep the majority. In the case that all three differ, you go with the first one, in this case B.
GOLDEN RULE OF ATOMICITY: Never overwrite last copy of your data
1. Storage writes may fail or may corrupt other blocks.
2. A later read can detect the bad block
3. Storage can spontaneously decay
4. Assume errors are rare
5. Repairs can be done in time
Number 1
Commit Record:
Record old versions of changes and save new ones when you make them.
Number 2
Journal:
Write planned changes so that if power goes out, it is easy to check if changes were made or not.