Notes by Matthew Dempsey, Vincent La Rocca, and Justin Lin
Contents
We discussed using commit records to improve file system robustness last lecture. Storage in the file system is broken down into small cells. All changes to these cells will be atomic. If the system crashes as we are changing the contents of a cell from A to B, then the cell will either contain A or B, but never garbage that is a mixture of A and B.
A journal is a buffer that stores a record of all the changes that we make to the file system. After the changes recorded in the journal have been completely performed on cell memory, those journal entries can be reclaimed. Typically, the older entries are no longer useful because those updates have already been made to cell memory. The journal can be thought of as an infinitely long buffer, but it is usually implemented by using a cyclic buffer that overwrites irrelevant journal entries.
Logging combines the journal and commit records to further improve file system robustness. Say that it takes 4 cell changes to move an example file. We want to move this file atomically, so we can use a commit record in the journal.
Basic Logging Protocol 1:
Basic Logging Protocol 2:
Protocol 1 vs. Protocol 2:
If the new value has been stored into the journal but not cell memory, then we can immediately place the newer journal version into cell memory with Protocol 1. Protocol 2 would require an extra read of cell memory.
Protocol 2 has a faster recovery because we start scanning the journal entry from the current time and move backwards. On the other hand, Protocol 1 requires scanning forward in time that can be difficult and/or inefficient since we must determine where to begin scanning the journal.
In the image above, the original write order is deemed inefficient by the disk scheduler because there is unnecessary movement of the disk arm. However, it is possible the write Wa may have been a commit for write Wb. In this case it would be an error to write the commit before actually writing the data, which is what the disk scheduler has decided as the optimal write order. This reordering will interfere with the crash recovery techniques. We need to place some constraints on the disk scheduler.
The best way to constrain the disk scheduler so that it does not interfere with the crash recovery is to create a dependency graph. In the image above the arrows represent a dependency. The disk scheduler is informed that it cannot write Wa before writing Wb. Also, it cannot write Wd before writing Wa. Other than these constraints, the disk scheduler is free to reorder the writes to optimize as much as possible.
Both of these examples reference memory that the user should not be accessing. The first example is attempting to dereference a null pointer, which is never a valid memory address. The second example is attempting to access an index of the array that is out of bounds.
Unfortunately both of these options are still too slow. So we do what we always do when we get stuck, which is pick up the phone and call our friends at Intel and ask for an improvement to the hardware.
The simple hardware solution to this problem is to keep a base and bounds register for each process. The base register points to the start of the memory that the process is allowed to use, and the bound register points to the first spot in memory that is beyond the allowed area for the process.
In other words, for any memory access to location A, base <= A < bound must be true to be considered a valid memory access. If an invalid access is made to the memory, then the hardware traps and the kernel takes over. This allows us to have a limited form of hard modularity.
Problems with Base and Bound Registers:
Each machine address used in a pointer is divided so that the upper bits represent a segment number in memory and the lower bits represent the offset within that segment. A process can have multiple segments, in the implementation above the segment number is the first 3 bits, so each process can have up to 8 segments. Each segment has its own base-bound pair. An application might have a text segment (read only program), data (read and write), the stack (read and write), etc..
There are still some problems with using segmentation however. To grow a segment, we move other segments aside. This process can become very expensive, especially if a program is working with large amounts of data. To solve these issues with segmentation, we use the concept of paging.
Paging is still similar to segmentation because each memory address represents a page number and an offset within that page. Now we pick a single, constant, relatively small size for all pages. We only need to keep track of the base of each page, because the bound is a fixed amount beyond the base.
The first part of a memory address represents a virtual page number, which is process specific. The hardware acts as a magic box and transforms the virtual page number into a physical location in memory. In the example implementation, 12 bytes are used for the offset so each page can hold 2^12 bytes of information in memory.
Page tables are actually handled by the hardware, but we can model it in software to gain a conceptual understanding.
The page table stores the physical location of every page. In this implementation, each entry is 32 bits and the first 26 bits are the location of the physical page and the remaining 6 are used as flags for information about that page. For example, these flags could indicate read, write, and execute permissions.
In order to make a certain page invalid, we can turn off the rwx flags. This will cause the hardware to trap if the user attempts to access that page. For example, we can turn off the rwx flags on page 0 so that if the user attempts to access a virtual page number of 0, the hardware traps. This protects against dereferencing of NULL pointers.
If we assume that we have 4 GiB of addressable memory space, each page holds 2^12 bytes, and there are 2^2 bytes for each page table entry, then approximately only 1/1000 of our RAM is used for the page table.
The %cr3 register tells the hardware where the page table is in memory. This is a privileged register, because otherwise a malicious program could load its own page table into the register and access anywhere in memory. It is also not possible to change the page table directly, because every pointer to memory must go through paging first. A page table does not have a page that contains itself in memory.
mmap is a system call that requests the OS to mess with your own page table in a very controlled way. It will allow a user to clone a page table for example. A user cannot arbitrarily alter a page table.
If each page table is 4 MiB, and each process has their own page table, then these can start to take up a more significant amount of memory. To alleviate this problem, page tables can be represented in a more sparse style. Frequently, there are many 0's all over the page table. Instead, we can use a two level page table, where each memory address is broken up into a page number, a first offset, and a second offset.
The following code segment is a software representation of this two level page table.
The technical term for returning OUCH is a page fault. In a page fault, the hardware traps, almost like the INT instruction. Control switches to the kernel, and the kernel can either terminate the program or send the signal SIGSEGV to your program.
In response, it would be good practice for the SEGV handler in the user to print the error and exit in a simple way or perhaps it can allocate more memory.
When using a large amount of pages, the RAM can be used to hold as many pages it can fit. The remaining extra pages should be stored on disk. The RAM acts as a cache for the disk. This works in principal, but it can become very slowing if swapping occurs frequently. This is known as thrashing.