CS111 Spring 2014 Lecture 14: File System Robustness and Virtual Memory

Central Concern with File Robustness

What happens to our data if we are in the middle of writing to our storage disk when we get a power failure?

It is important that when our system reboots we don't have scrambled data where we were writing with no idea of whether we finished or not. Idealy, all writes will appear to either have written the new changes completely or not written at all once our system is back up and running. We call such changes atomic, meaning they happen fully or don't happen at all. We explore Journaling as a means to expand file system robustness by adding atomicity to file writes as well a history for the state of the file system.

What is the journal?

A circular buffer of the data we want to write and logs of commits regarding whether the writes have successfully completed.

Journaling Protocol

Log the data to be updated before physically storing into cell memory
Write a commit log to the journal indicating that our previous entries (the data we want to write) are complete
Install the changes to the physical cell memory
Log that the entire process is complete

If we lose power in the middle of a write, we will have information in the journal about what we were trying to do and whether or not it was completed. The following table summarizes the result of recovery if we fail after the various points throughout the process.

Step In Process	State of File	Recovery Process
1	Before Write	We won't see the commit from (2) and will know that we need to re-collect the data that was supposed to be written. The data in cell memory will not have changed.
2	After Write	We will know that the data to be written is all in the journal but has not physically been written to cell memory. We can write it and finish the journaling protocol for this write immediately.
3	After Write	Upon reboot we might have garbage in cell memory, but all the data to be written (as confirmed by {2}) is stored in the journal and can be written immediately.
4	After Write	From the journal we will see that the entire write completed successfully and there is no more work to do for the given write

NOTE on journaling buffer

The buffer may fill up with data entries and commits before any of the data recorded is actually installed to the cell memory. In this case, the OS must prioritize installing data to the cell memory before any further write commands can be journaled, otherwise critical information would be lost.

NOTE on accessing the Journal History on SEASnet

On the SEASnet machines you can use the following command:

$cd ~/.snapshot

to access a directory clones of your user directory taken at various time intervals in the past. It appears that SEASnet keeps journals of the exact state of our user directory each hour, day and week. We are able to retrieve each past hour for the previous 8 hours, each past day for the previous 7 days, and each past week for the previous 2 weeks!

Performance concerns with Journaling

The major bottleneck for "naive journaling" is moving the disk arm. If the journal is kept in one contiguous region, as well as on the same disk as the data being changed, it may be constantly moving too and from cell memory to the journal, spending most of its time actually moving.

Performance Improvement Ideas

Have the Journal and data on two separate disks/devices. This way, each disk can communicate with each other through the cpu while minimizing arm movement, since the data disk arm can always be sitting in the current data being read while the journal disk arm can be sitting in the current journal entry that is being written. Also, the journal disk would need to write only as appends, so the disk arm would never have to seek during writes.
Don't use a physical disk for cell storage whatsoever -> Instead, cache the needs pieces in RAM
- This method is a big win if the file system is small enough to fit entirely into RAM. The journal disk arm would never need to seek since it is only appending entries, and access to the file system data would be extremely fast since it is stored in RAM.
- The major downsides here are space (since we are fastest when the FS fits in RAM, but RAM is obviously very small compared to a disk) and re-boot speed. The re-boot will be slower since we need to reconstruct the file system entirely from the journal, since it was all stored on RAM.

Introduction between disk scheduling (future) and Robustness (present)

Consider the following region of case, where we have 4 writes indexed in the order in which they were requested.

As mentioned before, disk arm movements are a major bottleneck of disk performance. The disk device has its own internal scheduler, and as a result, often re-orders writes to reduce the total distance the disk arm has to seek. With the above example, it is plausible the disk scheduler would re-order the writes as follows:

w2 -> w1 -> w4 -> w3

Problems with Disk Scheduling

Problem
Consider w2 is the commit signaling that we have succesfully written w1, but we crash right after w2 writes. Checking the journal, we would see w2 and think that w1 wrote properly, but we know that there would be incorrect data in w1's region.
Solution
We constrain the disk scheduler to not re-order our commits. Essentially, we mark w2 as being dependent on w1's completion, therefore not allowing the scheduler to re-order writes which must happen in specific sequence in order to maintain the integrity of our journaling system. In order to accomplish this we rely on both the OS and the disk controller. This of course implies that we hope the disk controller isn't buggy in any way, but as we have seen in previous lectures, disks are hard to test and perhaps not as reliable as we hope in some cases.

Unreliable Processes: Bad Memory References and Our Quest for Virtual Memory

Problem
E.g. 1

char *p = 0;
printf("%c", p); //p is null!

E.g. 2

char buff[BUFFSIZE];
printf(buff[BUFFSIZE]); // reference is outside of array!

Potential Solutions
- 1. Hire better software developers -> UCLA graduates of course. But then again, nobody is perfect.
  2. Use an emulator, such as QEMU.
  3. Insist on using a language with proper runtime subscript checking, such as Java. But both of these solutions (2 and 3) are slow! They add a lot of runtime cost to our process.
  4. Add Simple bounds-checking to hardware The hardware would know the base and bounds addresses for memory being used by a particular processes and would check each pointer refference in the process to ensure it is within the proper bounds, and trap otherwise.
    - Comparrison is an extra operation to be performed, but it is unlikely that there would be a speed decrease in pointer referencing due to hardware efficiency
    - This approach forces the process to have contiguous memory, which is very unflexible and forces the process to know how large it needs to be when it is created
    - Forces referenced code to be position-independent. This means we must use relative (not absolute) jumps. Code can be compiled to be position-independent (e.g. "$gcc -fPIC"), but this of course reduces the optimizations compilers can perform and therefor slows down the process.
  5. Segmentation
    - With segmentation, instead of having a single contiguous region in memory for the process, we can instead break the process into 8 different segments. We can then reference the segments with a 32-bit machine address where the first 3 bits represent the segment number and the last 29 bits represent our offset into the segment.
    - With this approach we would be able to use the different segments for different pieces of the process, such as read/txt, read/write data, stack, etc. We would also be able to size each independently and move them around in RAM without messing up the application.
    - Problem: It is still far too expensive an operation if we need to grow the clusters. It requires us to reserve new, larger segments of memory and copy all of our data from the current to the new segment.
Paging to the Rescue!
- With paging, we start by selecting a fixed size for our pages. This makes it so that we don't need a register to keep track of the bounds of our pages (we just use the start address and the length). In contrast to our previous setup, we can reference pages with a 32-bit address formatted with the first 20 bits being used to store the virtual page number and the remaining 12 bits to store the offset into that page.
- From here, we feed the virtual page number through a "Magic Box", which turns the page number into the physical address of the page we desire for the given process. It is inside of this "Magic Box" where we also do the bad memory reference checking.
- *Each process has its own set of virtual pages
- *Processes can also share pages
- So how does it work?
- A simple software represenation of paging would look as follows (although this is not implemented by software!)
- ```
int pmap (int vpn) {
  return PAGETABLE[vpn];
}
```
- Here, PAGETABLE is a global array whose base address is kept track of in the processes' %cr3 register (for x86 architecture).
- The entries in the PAGETABLE are similar to the address used to reference them. They are 32 bits long, with the first 20 bits reserved for the physical page address and the remaining 12 bits reserved for various flags, such as read, write, and execution permissions. Invalid entries in the page table would be marked as all 0's, and would be caught by the magic box algorithm when attempting to access an invalid page.
- Potential Problems
- What if we wanted to ilitiously change %cr3 to attempt to access memory which our process was not allowed to access
  - Solution
  - %cr3 is a privilledged register, and therefore only the kernel has the ability to make changes to it.
- What if we wanted to modify a PAGETABLE entry directly?
  - Solution
  - Similarly, a process can not directly change its own PAGETABLE entries. It can request limited types of changes by the kernel through the 'mmap' system call. However, the kernel closely audits these requests and doesn't allow a process to change much about its PAGETABLE entries.
- Page tables could take up a lot of unnecessary space if they are mostly empty
  - Solution
  - We could use a 2-level page table to save space. To do this, we would split our address into 3 pieces. The first 10 bits would give us the top level offset, the next 10 bits would be the second level offset, and the last 12 bits would be the offset into the physical page. Then, our top level PAGETABLE will have entries corresponding to 2nd lvl page tables, which in turn will have entries corresponding to actual pages. In this way, we only have to allocate 2nd level page tables as we need them, and can therefor dynamically grow or shrink the total pagetable size to more accurately fit our application's needs.
- Page Faults
  - Solution
  - When we try to access pages without the proper privilege the hardware traps and we enter the kernel which then decides whether to kill the program entirely or just set a SIGSEGV signal back to the program.
  - Then, process can decide how to handle the page fault. Typical handler functions might print out a simple error message or even allocate more memory (such as in the original design of the Borne shell). While allocating memory in this fashion has some performance advantages, it is typically considered bad style.
- Thrashing
  - When a process grows to be too large, if there is not enough physical RAM to hold all of the processes' virtual memory, it will be constantly be swapping parts of its virtual memory out for others. We call this swapping "thrashing." In practice, if a process performs lots of these swaps it will begin to run way too slowly.

File System Robustness and Virtual Memory

CS111 Spring 2014 Lecture 14 Notes, by Evan Gunning