We can expect that things will go wrong in our system; it is up to us to handle these cases accordingly. All systems are built out of components, and each of these components is a system of its own.
Depending on the assumptions we make about the components of our system, there are two ways to accomplish reliability:
Let's say we get an error with our disk controller cache (spontaneous errors). We read the block into the cache and the read operation completes perfectly. However, the cache breaks when we try to retrieve it later.
Reliability applies to anything that you're building. Google succeeded early on by choosing method 2. They bought the cheapest file servers possible with zero reliability checking, and then wrote software that would perform error checking. They saved a lot of money this way.
The method depends on what resources you have available and what your end goal is.
Problems that occur in our system can be categorized into external and internal events:
Gamma rays can pass through our components and set or unset arbitrary bits within our system.
Servers typically have ECC (error correction bits) since they need to run for extended periods of time. Adding ECC adds around $5 to the cost of RAM, and nobody wants to pay that much more for laptop or desktop RAM.
Assuming a power failure occurs, we can deal with it using our reliability methods.
We can get a better power supply, or use an uninterruptible power supply (UPS). In reality, there is no such thing as a UPS. It is just a back up battery that can also fail as well.
We can build a system that is reliable in overall behavior even if individual components fail
in the presence of power failure.
We can implement our file system such that a power loss does not result in data loss.
(after-reboot data = just before crash data).
We have a simple text file containing the grades of all CS111 students: cs111grad.txt.
The file is of fixed size: 20000 bytes.
We always read the entire file or write the entire file using the following functions:
read(fd, buf, 20000); //Loads data into grading application. Like LD operation.
write(fd, buf, 20000); //Writes data into the grading application. Like SAVE operation.
The implementation for read is straightforward; the implementation for write is not. Here is the inode structure of a file in Linux:
We cannot assume that writes are atomic. When we issue a write request, that request may be split up into write chunks of 512B, 1024B, etc depending on the behavior of the disk controller.
We have some consistency issues: 1) data vs metadata, and 2) data vs data (different parts).
1) If our power goes out after we have finished writing to our data blocks, but before we finish writing to our meta data, the file will look like it was not modified even though it was.
2) If the power goes out in the middle of writing to a data block, we might not allocate the necessary indirect or doubly indirect block.
Never overwrite your only copy unless your underlying write operation is known to be atomic.
So how do we update something when the underlying system is not atomic?
We can change our application so it maintains two copies of the file: a main copy and a temp copy.
We will overwrite the temporary copy first, and then we will overwrite the main copy.
If A and B disagree, we could say that MAIN.txt is the correct version. However, this is incorrect as there is no way to guarantee if MAIN.txt is correct after it begins to get overwritten.
If two didn't work...
Using three files will work. We look at all three copies and select the file where the majority rules (two out of three match). If all three disagree, then we pick TMP1.txt. This works because we know TMP1.txt will not be written to after that particular case. We can then pick up where we left off on the next read.
But this is too slow and it eats up too much space!
Instead of using a third copy of the original file, we will use a COMMIT block which will keep track of which one of your copies is correct copy to reference.
If the commit record is 0: look at MAIN.txt.
If it is 1: look at TMP.txt.
When we want to preserve atomicity in a system, we make several assumptions about our system.
Based on these principles, we have a generic algorithm for solving atomicity for high-level operations:
BEGIN -Do pre-commit actions: These are low level operations that may or may not be atomic. At this stage, the high level operation has not yet happened. We can prepare for the higher level operation, OR back out of these operations and leave no trace of the higher level operation. ex) When COMM is 0. -COMMIT: An atomic operation at the lower level. It tell us to begin ovewriting main. (higher level operation). ex) first two 1s -POST COMMIT phase: high level operation has already occurred. Cannot change our minds. But we can improve performance for the next time around. ex) COMM's last 1. DONE
Journaling is originally taken from double entry book keeping in accounting. The idea is that we never erase any data (append-only) and that we log all of our operations.
The purpose of the journal is to emulate the effect of atomic writes even if the underlying system does not support atomic writes. We can accomplish this by combining the journal with the cell storage.
Our journal will hold the following information:
One way to accomplish this is through a simple logging protocol, called a write-ahead log:
Suppose we are trying to change an A in the cell to B:
Note: If you look at the most recent entry in the journal it talks about values that aren't there yet.
In the case of a power failure, we scan the journal left to right looking for COMMITs that aren't DONE.
We then apply the necessary changes one by one for unfinished writes.
For larger file systems, RAM is a cache of actual cell storage. This approach is used more often in databases; main memory databases don't live on disk so as to not incur a large amount of disk I/O operations.
The write-behind log reverses the order:
We scan backwards through the log, looking for uncommitted operations, undoing as we go.
Pros:
Cons:
We will implement the algorithm for a rename as follows:
1) Get d's inode into RAM (assume D's inode pointers directly to the data) 2) Get d's data 3) Get e's inode (make sure it's a directory) 4) Get e's data because there are other directory entries that we have to write out 5) Update d's data in memory 6) Do the same for e 7) Write out d's data 8) Write out e's data
Assuming we crash in between steps 7 and 8, we run into a problem:
We've deleted d/f, but we haven't created e/g: we've lost our original file.
This is easily solved if we write out E's data first and then D's data afterwards.
But what if we have two directory entries to a file whose link count is 1?
This may result in a dangling hard link because two hard links link to the same file. If
we remove one of them, the link count will drop to zero.
We can fix this by adjusting our previous algorithm:
1)Load f's inode into RAM 1) Get d's inode into RAM (assume D's inode pointers directly to the data) 2) Get d's data 3) Get e's inode (make sure it's a directory) 4) Get e's data because there are other directory entries that we have to write out 5) Update d's data in memory 6) Do the same for e 7) Write out d's data 8) Write f's inode with link count after last update 9) Write f's inode again to fix link count 10) Write out e's data
Now we won't have consistencies in a file system that will cause crashes later. We can still have the link count at 2 while we have 1 hard link. However, this is still a problem because when we remove this file, its space will not be reclaimed. This will cause a space leak. Our system now will not crash due to such an error. To accoun for it, we can always run a file system checker later on our disk.