CS 111 Lecture 14 Scribe Notes: Virtual Memory
by Alan Covarrubias
Table of Contents
Atomic Writes to Disk
Virtual Memory
Atomic Writes to Disk
GZIP Bug
Before Bug Fix
fd = open(“foo.gz”, O_CREAT); write(); close(fd); unlink(“foo”);
After Bug Fix
dfd = open(“.”, O_SEARCH); openat(dfd, “foo.gz”, O_CREAT|…); write(); fsync(fd); // saves all the information associated with this file to disk including data and metadata fdatasync(dfd); // saves only data to disk, not metadata close(fd); unlinkat(dfd, “foo”);
Bug Explanation
There is a bug in the program GZIP that may cause files to be lost due to the write function writing to cache and not disk. Since the cache memory is volatile, if the system were to crash after it unlinks "foo" but before the write is written to disk, all data could be lost! This bug can occur in any program such as mv that moves a file from one place to another. The bug fix using fsync and fdatasync causes a very large decrease in the speed of the GZIP program, but it increases the reliability of the program by forcing the writes to be written to disk and not cache. This creates a predicament where we want speed and reliability but we cannot have both! If you want reliability, there are two main approaches one can take to assure this.
- Commit records
- Journaling
Commit Records
Assumptions for Writing Individual Sectors
- Individual sectors can be written atomically.
- If the power goes out before writing to disk, it is as if the write never happened.
- If the power goes out during a write to disk, you have enough power to finish.
W1 | W2 | ... | WN | Commit Record |
With a commit record, we can bundle several atomic actions into one large atomic action.
The first n writes are tentative until the last commit record is written.
If the last commit record was not written, then all the previous N records are ignored.
We have built a big atomic action out of small ones using journaling, which is similar to spin locks and mutexes.
Key idea is that we take multiple small atomic writes and build them into a large atomic write using the commit record.
Journaling
We divide our file system into two pieces cell memory and a log.
BEG | a:A | b:B | COMMIT | END |
a | ||||
b | ||||
We perform a write by first storing a commit record in the log before writing to cell memory.
Won’t we still have a problem? What if the system crashes before the commit is written?
After a crash, the CPU looks at the logs and checks for inconsistencies with the cell memory. If the log is inconsistent with the cell memory, then the write to cell memory repeats until the END is written. The END write to log signifies consistency between the cell memory and the log.
What happens when we run out of log?
In practice, when you build a file system, it’ll ask how big the log you want to be. When the log is filled, we write from the start recursively.
Suppose you crash in the middle of recovery
Recovery should be robust in the presence of crashing. Must make recovery idempotent. Ex. recover(recover(x)) = recover(x)
Step by Step Atomic Implementation
- BEGIN
- Log intended changes (pre-commit)
- Save enough details so when a system crash occurs, we remember what the CPU was about to write to disk. Store a commit in the log when this phase finishes, otherwise store an abort in the log. A crash during this phase leaves the cell memory in the same state before the write was started.
- COMMIT/ABORT
- Copy changes from log to cell data (post-commit)
- If a crash occurs in this phase, we can recover by using the data in the log. On a reboot, the CPU will check for inconsistencies between the cell memory and the log If a commit was written, the CPU will use the log to fix any inconsistencies.
- Log intended changes (pre-commit)
- END
Logging Protocols
There are two main types of logging protocols that support atomic writes to disk.
Write-Ahead Protocol
With write-ahead, the CPU first stores the values it plans to write to memory in the log. Once finished, it stores a commit in the log and begins writing data to cell memory. If the system crashes before the commit, then the write is lost and the cell data stays consistent with the old data. If the system crashes after the commit, then on recovery the CPU searches forward looking for commits that have not yet been installed to cell memory.
- Steps:
- Log planned writes
- Log commit when successful, otherwise log abort
- Write data to cell memory
- Pros/Cons:
- + More likely to save the last action.
- + No need to know the old values.
- - Slower recovery when a crash occurs after the commit is written.
Write-Behind Protocol
With write-behind, the CPU stores the old values in a log before writing data to cell memory. The CPU only writes a commit to the log when the write to cell memory is complete. If the system crashes before writing any new data to cell memory, recovery is simple since only changes were made to the log. This leaves leaves the cell memory in the old state before the start of the write. If a crash occurs while writing some new data to disk, the CPU must use the logged old data to return the cell memory back to the old state.
- Steps:
- Log old cell values
- Write new values to cell memory
- Commit
- Pros/Cons:
- + Faster Recoverys
- + Recovery tends to be faster because we know We scan from right to left until the most recent end point and then stop scanning.
- - Old data already written in cache, so the first step is unnecessary unless there's a crash during writes to cell memory.
Main Memory Database
A rare type of database that stores all its data in RAM instead of writing it to cell memory. Since RAM is volatile, a crash will lose all its data. Any changes to the virtual memory database is logged on disk. On reboot, all memory must be rewritten back to RAM by using the log writes written on disk.
Pros/Cons:
- +Access to cells is very fast because data is stored in RAM.
- +No seeks when writing since every write appends an entry to the end of the log.
- -Recovery is slow due to all the memory being written completely in RAM.
Higher Order Operations
When a high level operation calls a lower level operations. We have two options that we can take here on how these processes interact. Cascading aborts are more automatable, but compensating actions are more flexible.
Cascading Aborts
- If a higher level process performs an atomic action by calling a lower level process, if a lower level process returns an error and aborts, the higher level process will also abort. Multiple processes can be nested and aborts in inner processes will abort outer processes, thus the aborts cascade.
- Example: Write-Ahead Protocol Cascading Abort:
- While logging planned writes, error occurs
- Abort record
- Send cascading aborts to higher level functions.
Compensating Actions
- When the high level operation calls a lower level operation that fails, the higher level operation figures out something else to do to make up for the failure. Tend to use compensating actions after a commit is written.
- Example: Write-Ahead Protocol Compeansating Action:
- Log planned writes
- Commit record
- While writing to disk, error occurs. After reboot compensating actions continue writing to cell memory using data written in the log.
Mount Tables
Can we partition our data into 10% “must be persistent” vs 90% “okay if we lose it”?
Suppose you have an application that needs to be reliable and another one that doesn’t need to be. Or suppose you have a mixture of data, some that require reliability and some that don’t.
Linux allows this by introducing a data structure called a mount table.
This maps a file system reference and an inode to the file system that contains the data.
(reference, inode#) => | file_sytem |
(dev_t, inode#) => | dev_t |
Virtual Memory
We've examined how to protect data from power outages using methods that allow atomic writes to memory. Now we'll examine a different form of unreliability that can cause problems: unreliable programs that make bad memory references.
Possible Solutions:
- Hire better programmers (impractical because we all make mistakes)
- Get help from our hardware friends (allows for a better increase in speed but is expensive)
- Get help from compiler and/or runtime (Use programming languages that check memory references, slower than using hardware)
A Naive Approach
Kernel Code | Process 1 | Process 3 | Process 2 | Kernel Code |
When any process runs, two registers will store the base and bounds of the memory allocated for that process. Any memory reference by that process will check if it lies in the bounds. Processbase <= Memory Address < Processbound
It works well in batch environments but it does not work in dynamic environments. Although you can copy data around around to free up space when growing or shrinking memory, copying is expensive. Does not allow a simple method of sharing data between processes. It would be useful if segments did not have to be contiguous in physical memory.
Solution:
Instead of having a base and bounds pair, we could also maintain multiple register pairs for each process, with each pair marking the boundary for some area in the memory such as text, data and process's stack. Additionally, we could dedicate one pair to a shared memory region, which solves the problem of sharing memory in the naive approach.
Paging
A more efficient way of isolating memory per process is by giving each process its own page map. A page map is an array of page map entries where each entry translates a small fixed-size range of contiguous bytes of virtual addresses, called a page, to a range of physical addresses called a block. Each entry in the page map maps a page number to a block number and uses the byte offset of the address to return the exact physical address of the memory the program would like to dereference. This is implemented by a the Virtual Memory Manager that was imbedded into the hardware. If a program's page table does not contain a physical address, that process cannot reference that piece of memory.
Advantages of Paging
- You create a page fault if the memory isn’t in RAM which uploads it from the disc. Only works efficiently if you assume the program doesn’t access memory randomly.
- Programs can share memory “safely”. You can implement interprocess communication efficiently and safely if different page table entries map to the same physical block.
- We don’t have to worry about fragmentation. The indirection allows contiguous virtual addresses to be discontiguous in physical memory. This allows the dynamic growing and shrinking of virtual address spaces to be efficient.