Central Concern with File Robustness
What happens to our data if we are in the middle of writing to our storage disk when we get a power failure?
It is important that when our system reboots we don't have scrambled data where we were writing with no idea of whether we finished or not. Idealy, all writes will appear to either have written the new changes completely or not written at all once our system is back up and running. We call such changes atomic, meaning they happen fully or don't happen at all. We explore Journaling as a means to expand file system robustness by adding atomicity to file writes as well a history for the state of the file system.
What is the journal?
A circular buffer of the data we want to write and logs of commits regarding whether the writes have successfully completed.
Journaling Protocol
- Log the data to be updated before physically storing into cell memory
- Write a commit log to the journal indicating that our previous entries (the data we want to write) are complete
- Install the changes to the physical cell memory
- Log that the entire process is complete
If we lose power in the middle of a write, we will have information in the journal about what we were trying to do and whether or not it was completed. The following table summarizes the result of recovery if we fail after the various points throughout the process.
Step In Process |
State of File |
Recovery Process |
1 |
Before Write |
We won't see the commit from (2) and will know that we need to re-collect the data that was supposed to be written. The data in cell memory will not have changed. |
2 |
After Write |
We will know that the data to be written is all in the journal but has not physically been written to cell memory. We can write it and finish the journaling protocol for this write immediately. |
3 |
After Write |
Upon reboot we might have garbage in cell memory, but all the data to be written (as confirmed by {2}) is stored in the journal and can be written immediately. |
4 |
After Write |
From the journal we will see that the entire write completed successfully and there is no more work to do for the given write |
NOTE on journaling buffer
The buffer may fill up with data entries and commits before any of the data recorded is actually installed to the cell memory. In this case, the OS must prioritize installing data to the cell memory before any further write commands can be journaled, otherwise critical information would be lost.
NOTE on accessing the Journal History on SEASnet
On the SEASnet machines you can use the following command:
$cd ~/.snapshot
to access a directory clones of your user directory taken at various time intervals in the past. It appears that SEASnet keeps journals of the exact state of our user directory each hour, day and week. We are able to retrieve each past hour for the previous 8 hours, each past day for the previous 7 days, and each past week for the previous 2 weeks!
Performance concerns with Journaling
The major bottleneck for "naive journaling" is moving the disk arm. If the journal is kept in one contiguous region, as well as on the same disk as the data being changed, it may be constantly moving too and from cell memory to the journal, spending most of its time actually moving.
Performance Improvement Ideas
- Have the Journal and data on two separate disks/devices.
This way, each disk can communicate with each other through the cpu while minimizing arm movement, since the data disk arm can always be sitting in the current data being read while the journal disk arm can be sitting in the current journal entry that is being written. Also, the journal disk would need to write only as appends, so the disk arm would never have to seek during writes.
- Don't use a physical disk for cell storage whatsoever -> Instead, cache the needs pieces in RAM
- This method is a big win if the file system is small enough to fit entirely into RAM. The journal disk arm would never need to seek since it is only appending entries, and access to the file system data would be extremely fast since it is stored in RAM.
- The major downsides here are space (since we are fastest when the FS fits in RAM, but RAM is obviously very small compared to a disk) and re-boot speed. The re-boot will be slower since we need to reconstruct the file system entirely from the journal, since it was all stored on RAM.
Introduction between disk scheduling (future) and Robustness (present)
Consider the following region of case, where we have 4 writes indexed in the order in which they were requested.
As mentioned before, disk arm movements are a major bottleneck of disk performance. The disk device has its own internal scheduler, and as a result, often re-orders writes to reduce the total distance the disk arm has to seek. With the above example, it is plausible the disk scheduler would re-order the writes as follows:
w2 -> w1 -> w4 -> w3
Problems with Disk Scheduling
Problem
- Consider w2 is the commit signaling that we have succesfully written w1, but we crash right after w2 writes. Checking the journal, we would see w2 and think that w1 wrote properly, but we know that there would be incorrect data in w1's region.
Solution
-
We constrain the disk scheduler to not re-order our commits. Essentially, we mark w2 as being dependent on w1's completion, therefore not allowing the scheduler to re-order writes which must happen in specific sequence in order to maintain the integrity of our journaling system. In order to accomplish this we rely on both the OS and the disk controller. This of course implies that we hope the disk controller isn't buggy in any way, but as we have seen in previous lectures, disks are hard to test and perhaps not as reliable as we hope in some cases.
Unreliable Processes: Bad Memory References and Our Quest for Virtual Memory
Problem
- E.g. 1
char *p = 0;
printf("%c", p); //p is null!
- E.g. 2
char buff[BUFFSIZE];
printf(buff[BUFFSIZE]); // reference is outside of array!
Potential Solutions
-
- Hire better software developers -> UCLA graduates of course. But then again, nobody is perfect.
- Use an emulator, such as QEMU.
- Insist on using a language with proper runtime subscript checking, such as Java. But both of these solutions (2 and 3) are slow! They add a lot of runtime cost to our process.
- Add Simple bounds-checking to hardware
The hardware would know the base and bounds addresses for memory being used by a particular processes and would check each pointer refference in the process to ensure it is within the proper bounds, and trap otherwise.
- Comparrison is an extra operation to be performed, but it is unlikely that there would be a speed decrease in pointer referencing due to hardware efficiency
- This approach forces the process to have contiguous memory, which is very unflexible and forces the process to know how large it needs to be when it is created
- Forces referenced code to be position-independent. This means we must use relative (not absolute) jumps. Code can be compiled to be position-independent (e.g. "$gcc -fPIC"), but this of course reduces the optimizations compilers can perform and therefor slows down the process.
- Segmentation
- With segmentation, instead of having a single contiguous region in memory for the process, we can instead break the process into 8 different segments. We can then reference the segments with a 32-bit machine address where the first 3 bits represent the segment number and the last 29 bits represent our offset into the segment.
- With this approach we would be able to use the different segments for different pieces of the process, such as read/txt, read/write data, stack, etc. We would also be able to size each independently and move them around in RAM without messing up the application.
- Problem: It is still far too expensive an operation if we need to grow the clusters. It requires us to reserve new, larger segments of memory and copy all of our data from the current to the new segment.
Paging to the Rescue!
-