Lecture 14: File system robustness

Notes by Matthew Dempsey, Vincent La Rocca, and Justin Lin

Contents

Improving file system robustness

1. Commit Records

We discussed using commit records to improve file system robustness last lecture. Storage in the file system is broken down into small cells. All changes to these cells will be atomic. If the system crashes as we are changing the contents of a cell from A to B, then the cell will either contain A or B, but never garbage that is a mixture of A and B.

Fig.1 - Cell storage: A cell is being changed from A to B.

2. Journaling

A journal is a buffer that stores a record of all the changes that we make to the file system. After the changes recorded in the journal have been completely performed on cell memory, those journal entries can be reclaimed. Typically, the older entries are no longer useful because those updates have already been made to cell memory. The journal can be thought of as an infinitely long buffer, but it is usually implemented by using a cyclic buffer that overwrites irrelevant journal entries.

Fig.2 - Journal: Garbage at the beginning of the journal and an example entry of writing B into the cell storage above.

3. Logging

Logging combines the journal and commit records to further improve file system robustness. Say that it takes 4 cell changes to move an example file. We want to move this file atomically, so we can use a commit record in the journal.

Basic Logging Protocol 1:

Fig.3 - Basic Logging Protocol 1: Shows the journal with a begin, all changes, a commit, and an end. Also shows the 4 necessary changes in cell memory.
  1. Log update in journal before changing cell memory
  2. Place a commit in the journal
  3. Apply changes to cell memory

Basic Logging Protocol 2:

  1. Save the old values into the journal
  2. Write the new values into cell memory
  3. Place a commit in the journal

Protocol 1 vs. Protocol 2:

If the new value has been stored into the journal but not cell memory, then we can immediately place the newer journal version into cell memory with Protocol 1. Protocol 2 would require an extra read of cell memory.

Fig.4 - Protocol 1 vs. Protocol 2: where to start crash recovery scan

Protocol 2 has a faster recovery because we start scanning the journal entry from the current time and move backwards. On the other hand, Protocol 1 requires scanning forward in time that can be difficult and/or inefficient since we must determine where to begin scanning the journal.

Performance improvements for log-structured file systems

  1. Place journal on a separate disk to avoid seeking: This approach avoids the cost of moving a single disk arm between the journal and cell memory regions. The separate device with the journal avoids lots of seeks since the journal only requires an append operation.
  2. Cache cell storage in RAM as needed: Note that this is simply an extreme version of suggestion 1. No seeks will be needed, except during the recovery phase which will require an expensive rebooting period. If the file system fits in RAM, this approach avoids all seeks and is slow only after a reboot.

Interaction between performance and robustness

Disk scheduling and robustness

Fig.5 - An original order of writes to disk and the writes reordered by disk scheduler.

In the image above, the original write order is deemed inefficient by the disk scheduler because there is unnecessary movement of the disk arm. However, it is possible the write Wa may have been a commit for write Wb. In this case it would be an error to write the commit before actually writing the data, which is what the disk scheduler has decided as the optimal write order. This reordering will interfere with the crash recovery techniques. We need to place some constraints on the disk scheduler.

Fig.6 - Dependency graph image using the previous writes.

The best way to constrain the disk scheduler so that it does not interfere with the crash recovery is to create a dependency graph. In the image above the arrows represent a dependency. The disk scheduler is informed that it cannot write Wa before writing Wb. Also, it cannot write Wd before writing Wa. Other than these constraints, the disk scheduler is free to reorder the writes to optimize as much as possible.

Unreliable processes with bad memory references

Code Segment 1: Examples of bad memory references

Both of these examples reference memory that the user should not be accessing. The first example is attempting to dereference a null pointer, which is never a valid memory address. The second example is attempting to access an index of the array that is out of bounds.

  1. Hire better programmers. These bad memory references are not a problem if the program never attempts to use a bad memory reference. However, even the best programmers will make mistakes so we need more assured methods of avoiding this problem.
    1. Use an emulator such as QEMU.
    2. Require subscript and null pointer checking at runtime, like in Java for example

Unfortunately both of these options are still too slow. So we do what we always do when we get stuck, which is pick up the phone and call our friends at Intel and ask for an improvement to the hardware.

Simple Hardware Solution: Base and Bound Registers

The simple hardware solution to this problem is to keep a base and bounds register for each process. The base register points to the start of the memory that the process is allowed to use, and the bound register points to the first spot in memory that is beyond the allowed area for the process.

Fig.7 - Base and bound registers and accessible memory

In other words, for any memory access to location A, base <= A < bound must be true to be considered a valid memory access. If an invalid access is made to the memory, then the hardware traps and the kernel takes over. This allows us to have a limited form of hard modularity.

Problems with Base and Bound Registers:

Fig.8 - Sort X and sort Y running as independent processes

Alternative Solution: Segmentation

Fig.9 - Division of machine address into segment number/segment offset

Each machine address used in a pointer is divided so that the upper bits represent a segment number in memory and the lower bits represent the offset within that segment. A process can have multiple segments, in the implementation above the segment number is the first 3 bits, so each process can have up to 8 segments. Each segment has its own base-bound pair. An application might have a text segment (read only program), data (read and write), the stack (read and write), etc..

There are still some problems with using segmentation however. To grow a segment, we move other segments aside. This process can become very expensive, especially if a program is working with large amounts of data. To solve these issues with segmentation, we use the concept of paging.

Alternative to Segmentation: Paging

Paging is still similar to segmentation because each memory address represents a page number and an offset within that page. Now we pick a single, constant, relatively small size for all pages. We only need to keep track of the base of each page, because the bound is a fixed amount beyond the base.

Fig.10 - Memory address using a virtual page number and magic box to a physical address

The first part of a memory address represents a virtual page number, which is process specific. The hardware acts as a magic box and transforms the virtual page number into a physical location in memory. In the example implementation, 12 bytes are used for the offset so each page can hold 2^12 bytes of information in memory.

Modeling x86 Simple Page Tables in Software

Page tables are actually handled by the hardware, but we can model it in software to gain a conceptual understanding.

Code Segment 2: Software model of page tables

The page table stores the physical location of every page. In this implementation, each entry is 32 bits and the first 26 bits are the location of the physical page and the remaining 6 are used as flags for information about that page. For example, these flags could indicate read, write, and execute permissions.

In order to make a certain page invalid, we can turn off the rwx flags. This will cause the hardware to trap if the user attempts to access that page. For example, we can turn off the rwx flags on page 0 so that if the user attempts to access a virtual page number of 0, the hardware traps. This protects against dereferencing of NULL pointers.

If we assume that we have 4 GiB of addressable memory space, each page holds 2^12 bytes, and there are 2^2 bytes for each page table entry, then approximately only 1/1000 of our RAM is used for the page table.

The %cr3 register tells the hardware where the page table is in memory. This is a privileged register, because otherwise a malicious program could load its own page table into the register and access anywhere in memory. It is also not possible to change the page table directly, because every pointer to memory must go through paging first. A page table does not have a page that contains itself in memory.

Using page tables

mmap is a system call that requests the OS to mess with your own page table in a very controlled way. It will allow a user to clone a page table for example. A user cannot arbitrarily alter a page table.

If each page table is 4 MiB, and each process has their own page table, then these can start to take up a more significant amount of memory. To alleviate this problem, page tables can be represented in a more sparse style. Frequently, there are many 0's all over the page table. Instead, we can use a two level page table, where each memory address is broken up into a page number, a first offset, and a second offset.

Fig.11 - Large page table and more sparse 2-level page table representation of that table

The following code segment is a software representation of this two level page table.

Code Segment 3: Software implementation of a two level page table.

The technical term for returning OUCH is a page fault. In a page fault, the hardware traps, almost like the INT instruction. Control switches to the kernel, and the kernel can either terminate the program or send the signal SIGSEGV to your program.

In response, it would be good practice for the SEGV handler in the user to print the error and exit in a simple way or perhaps it can allocate more memory.

When using a large amount of pages, the RAM can be used to hold as many pages it can fit. The remaining extra pages should be stored on disk. The RAM acts as a cache for the disk. This works in principal, but it can become very slowing if swapping occurs frequently. This is known as thrashing.

Fig.12 - RAM used as a cache for many pages stored on disk