Suppose we are writing a file to disk. However, in the middle of this write, a power surge occurs and our entire system crashes. The disk would then contain an incorrect version of the file. Ideally we would like to be able to recover from problems like this by restoring the file system to its original state, before the write ever occured. How can we implement this?
There are two ideas that will help us with this: commit records and journals.
A commit can be viewed as "the single, low-level write operation that matters." It is the operation that determines, from the system's point of view, whether or not a file was written. It is very important that the commit is an atomic operation, since it has to determine without a doubt whether or not the file was written successfully.
We can keep track of commits in the form of commit records. These records contain the details of the operations that are associated with the commit. When we discuss journals in the next section, these commit records will be used to restore a file system to its original state, before a crash.
Commit records are best illustrated with an example. The process of updating an existing file might go something like this:
A. Write blocks to a DIFFERENT section of disk (a copy area)
B. Wait for blocks to actually hit disk
C. Write a commit record (tells the system where to find the file and how to restore it)
D. Copy blocks from the copy area to the original area
E. Wait for blocks to hit disk
F. Erase the commit record
Note that the commit record is written only AFTER the blocks actually hit the disk. Up until step C, the file is still in its original location, with its original content (from the system's point of view). Once the commit record is actually written, the system will recognize that the file's contents have been changed. Note also that once step C is complete, the file is NOT in its original location.
The way commit records are used in this example ensures that if the system crashes at some point during steps A or B, nothing changes. Since we write to a DIFFERENT section of disk, the original file remains intact even if the system crashes. If we successfully write the commit record, it means that steps A and B were executed normally, so the new file has been written to the copy area. Because of the way we have constructed this process, the file is either update or not. There are no scenarios where a file could be partially written to, so a file can't become corrupted.
After the file has been updated, all that remains to be done is move the file back to its original location. Similarly to the first few steps, we must wait for the blocks to hit the disk. Once this completes, the file has been update and we can erase the commit record. The file is now updated, and resides in its original location on disk.
The distinction between steps A and B is very important. Often when a write is requested, the disk controller will report that the data has been written, even though it may still be residing in a cache. Hence we need to do some low-level disk checking to confirm that the data has actually been written.
A general outline for this process is as follows:
BEGIN
Pre-commit phase (we can easily back out during this time)
COMMIT (or ABORT, if pre-commit phase failed)
Post-commit phase (typically done for efficiency)
END
A journal is a way of organizing a file system to make commit records. It consists of two sections: cell data (the copy area), and the commit records that specify proposed changes to the file system.
There are a couple key advantages to organizing a file system this way:
1. Writes can be contiguous during bursts of activity
2. The journal can be on a separate device
With regards to the second benefit: journals are often kept on SSDs, which allows extremely fast performance in creating the journal, making it seem to the user that the drive is working at very high speed. This is because when the journal entry is committed, the disk controller can report that the entire write is completed because all the data necessary to do so can be recovered. This allows the user program to continue as if the data had actually been written, while the actual disk write is done at a time when the disk isn't as busy.
There are two different logging styles we can use in file system journaling: write-ahead logs and write-behind logs.
In write-ahead logs, all the potential changes to the file system are logged before they are actually carried out.
Write-ahead logs:
1. Log all the writes you plan to do
2. Write the commit record
3. Install the new values into cell data
Write-behind logs:
1. Write old values into journal
2. Install new values into cell data
3. Write the commit record
Write-behind logging is certainly easier to implement. The recovery procedure is surpringly simple: walk through the journal in reverse, carrying out the instructions outlined in each commit record.
Suppose we ran grep on weensy OS. When grep reads a line, it stores it in a single buffer. Weensy OS does not check to see if the buffer can fit into a process's stack. If the buffer size is larger than the remaining stack space allocated for a weensy OS process, the buffer will overwrite a neighboring process's memory. In extreme cases, the buffer may overwrite kernel code, causing the kernel itself to crash! We need some method of dictating what sections of memory a process is allowed to access.
Some proposed solutions to this problem:
1. Hire better programmers - this approach is too expensive, and in the long term, not very feasible
2. Runtime checks in compiler generated code - memory references occur FAR too often for this option to work. Memory references have to be fast, and if we checked each reference in software, it would slow down a program immensely.
3. Base/bounds registers - have a set of priveleged registers that determine the span of memory that a process is permitted to access. It can be implemented in hardware, and processes are prevented from setting their own bounds. This seems like a reasonable place to start.
The idea here is that each process has a base register (the starting address of permissable memory accesses) and a bounds register (the ending address). When a process references memory that does not lie in this region, a trap is executed and the kernel takes over.
There are some problems with this approach:
1. Each process would probably have to be preallocated a fixed amount of memory. Growing these memory regions would be possible, just awkward and most likely prone to bugs.
2. Fragmentation. Processes tend to allocate more memory than they actually need, so if there are already many processes running, chances are that it will be very difficult to find a contiguous memory region that a new process can use.
3. Processes can't share memory! We could allow for the base/bounds pairs to overlap by a certain amount in order to create a shared memory region, but this only works for a pair of processes. It would be impossible for three or more processes to share memory.
One possible fix is to have multiple base/bounds pairs. We could have one for text (read only) and one for data (read/write). This way, processes wouldn't have to maintain their own copies of read-only data. Instead they could all share the same text region.
Note that under this approach, processes must be told exactly where to run. Only absolute jumps (jumps to a fixed address) would be possible. This is inefficient: we would like to allow for relative jumps (specified with an offset from an address), to allow the process more flexibility. This brings us to segmentation.
In segmentation, we divide memory into varying-sized chunks, called (surprise!) segments. Furthermore, instead of a direct address to physical memory, we can divide an address into two sections: a segment number, and an offset.
Suppose we are on an x86-64 machine. We can use the first 16 bits to specify the segment number, and the remaining 48 bits to specify the offset within that segment. When looking up a memory reference, the segment number is looked up first, in a segment table. This table maps the segment number to a physical address (the actual start of the segment in physical memory), and then the offset is added to this to compute the complete physical address. Each segment has its own base/bounds pair, so if the offset exceeds the bounds, the kernel will seize control.
Most of our problems are solved with this approach, though fragmentation still remains an issue because segments can grow very large in contiguous memory. We need a solution that solves fragmentation, but still allows processes to acquire more memory.
Paging is similar to segmentation, though it has some key advantages. Processes have access to a virtual address space, rather than physical memory. Each virtual address is composed of a page number, and an offset within that page. The virtual address is then mapped to physical memory, where the actual data is stored. In contrast to segments, pages are fixed in size.
This mapping occurs via a page table which is an array of physical addresses indexed by page number. Suppose we are on a 32-bit x86 machine. In a virtual address, the first 20 bits will correspond to the page number. The hardware then uses this page number to look up a pointer to a page in physical memory, where the actual page is located.
Then the process of looking up a virtual address is as follows:
1. Look up the page table entry corresponding to the page number. This gives the physical address of the start of the page.
2. Add the offset to the start of the page to compute the complete physical address.
Unfortunately the simplest scheme for virtual addresses, with a single page number and single offset, conflicts with another necessary aspect of pages, that they be kept small enough to disregard internal fragmentation. This is due to the fact we want to keep each page table relatively small and uniform size. In order to satisfy both requirements, it is common to use multi-level page tables, which use indirection so a master page table can have entries pointing to second level page tables, which themselves have entries pointing to even another level of page tables or the data itself. Of course, for a scheme like this the virtual address must change to incorporate the offset in the master page table, the offset in the secondary or tertiary page tables, and the offset in the page itself of the desired data.
An example of a software implementation can further illuminate how 2-level page mapping works. In this example, our virtual address has 10 bits for the master table index (hi), another 10 bits for the intermediate table index (lo), and 12 bits for the offset (offset0). We have to check for FAULT cases at each step; these will be discussed shortly.
size_t pmap(size_t va) {
int offset0 = va & (1<<12) - 1;
int lo = (va >> 12) & (1 << 10);
int hi = va >> 22;
size_t *l0page = PAGETABLE[hi];
if (l0page == FAULT)
return FAULT;
size_t *l1page = l0page[lo];
if (l1page == FAULT)
return FAULT;
return *l1page + offset0;
}
Of course, using this code for every memory reference would be extremely slow, so we'd want this functionality built into the hardware. However, if a process just wants to KNOW a physical address without actually reading its contents, then this function would be suitable.
Since our entire address space is virtual, processes don't need to have a contiguous memory space anymore! This problem was the main drawback to our segmentation approach.
Let's consider what happens when a process references a page that isn't in memory. This is called a page fault. In the code above, this is exactly what we are checking for at each step. When a page fault occurs, we trap into kernel mode. But how should we handle a page fault? We have three options:
1. Terminate the process
2. Send a segfault signal, SIGSEGV (the kernel then arranges for an RTI to a signal handler)
3. Arrange for the reference to become valid, by loading the page into RAM, then modifying the page table to reflect this change
If we can arrange for (3) to happen, we can allow a process to use more memory than RAM actually provides! Extra pages can be kept on disk. When a process references those pages, a page fault will occur. The kernel will then swap the requested page into RAM, modify the page table, and return to the failed instruction.
The role of RAM can then be described as a cache for disk. This works best when there is strong locality of reference, since in this case we will consistently be using the same pages, meaning there is no need to swap pages from disk (which is slow).
However, situations can arise when page faults occur repeatedly. This is called thrashing. It occurs when there is little locality of reference, so processes frequently require different pages. The kernel must then repeatedly read/write to disk as it swaps pages, and this can take an enormous toll on performance.