Scribe: Matthew Nelson
- Sections:
- Robustness
- Commit records
- Journal
- Peeking inside transactions
- Virtual memory
Two ideas for surviving power failures (and other robustness challenges)
- Idea #1: a commit record
e.g.
- write blocks from RAM to copy area
- wait for blocks to hit disk
- write a separate commit record
This says where to find the file
- copy from the copy area to the original area
File contents changed, location moved
- clear commit record
File contents new, location unchanged
General pattern
- Pre-commit phase
during this time, you can easily back out with no change
- Commit or abort
- If you commit, the data is written to the original area
- If you abort, the commit record is removed
- Post-commit phase or post-abort phase
done for efficiency, typically
- Idea #2: a journal
2 areas
- Cell data
blocks of data in the file system to be updated one at a time
- Journal
Goes on “forever”
- bounded – wraps around to the beginning
garbage collection maintains the list
List of proposed changes to cell data
- for example, write a commit record
Advantages:
- Writes can be contiguous bursts of activity
This requires that the journal is in a contiguous piece of disk
- Journal can be located on a different device
e.g. have the journal on an SSD
- There is no need to put cell data on disk at all
Cell data can be kept in RAM
- writes to journal must happen, but writes to data can be delayed
Two different types of logs
- Write-ahead logs
- log all writes you plan to do
- commit
- install new values into cell data
- must be replayed left-to-right
- Write-behind log
- log all the old values into the journal
- install new values into cell data
- commit
- must be replayed backwards
- Tends to be faster as the stuff that’s needed is at the end
- Repairs must be idempotent
You have to be able to repair even after a failed repair due to another crash, etc.
A more complicated approach to robustness: Let apps “look inside” pending transactions
- Pros:
- Better parallelism/performance
- Lower latency
- Cons:
- Need a way to notify apps that a transaction they peeked inside was aborted
e.g. a signal
- Cascading aborts
If process D depends on process C which depends on B which depends on A,
then if A's write is reported as failed after D wrote to disk...
- Compensating actions
- Process A writes
- Process B peeks inside A’s write, writes, commits
- Process A aborts
- Process B already committed and must remedy the problem
Virtual memory
- Problem
We have unreliable programs that create bad memory references
- e.g. GNU grep 2.10 on 64-bit hosts
if run on a bare machine, whole system can fail
- Solution
- Hire better programmers!
This solution has been tried – it sometimes works, but it costs $$$
- Use run-time checks in compiler-generated code
Works, but slower
- Base & bounds register pair
- First register points to first spot in RAM that can be used
- Second register points to the last spot in RAM that can be used
- Hardware checks that all memory references fall in this range
- If you use an invalid spot in RAM, hardware traps, kernel boots you out
- Problems
- You need to preallocate all RAM
- You can grow memory, but it’s awkward
- RAM limits efficient use of processes
- Memory gets fragmented
Both internal and external
- Applications can change their own boundary registers
Just make them write-privileged
- No memory-sharing
You can have two overlap, but three is right out
- Multiple base & bound pairs
- for text (read-only)
- for data (read-write)
- Problems
- Position-dependent
Pretend we’re running cat
text start at 0x30000, up to 0x40000
Now when we compile cat, you must tell it where to run
- Solution: design the program so it doesn’t care where it runs
- relative jumps, not absolute
- Position-independent code (PIC)
- Alternate solution: you can dynamically link or patch
- Fragmentation
Can’t grow/shrink segments easily
- Solution:
use fixed-size pages
e.g. 4KiB
- Virtual address
20-bit page number
- Refers to the page table
Page #0 to 2^20-1
- Each entry in the page table, in turn, points to a page of physical memory
These are also 4KiB in size
These are represented in the page table by 20 bits of page number and 12 bits of flags
size of page table = bounded
12-bit offset
- This offset is carried over directly as the offset in the phsyical page
- This requires hardware assistance
- Common situation: VM for one process
forbidden | text | forbidden | data -> | forbidden | <- stack
- forbidden areas are large compared to all data
- We can make the page table smaller by having a 2-level page table
page size = 4KiB
master table points to intermediate tables
intermediate tables point to actual addresses
only have an intermediate table when the memory it points to is actually being used
virtual address with intermediate page number:
x86 MMU (memory management unit) can handle this
Software emulation of 2-level page table (or at least an approximation)
- size_t pmap(size_t va) {
- int offset o = va & (1<<12)-1;
int lo = va >> 12 & (1<<10)-1;
int hi = va >> 22;
size_t *L0page = PAGETABLE[hi];
if(L0page == FAULT)
size_t *L1page = L0page[lo];
if(L1page == FAULT)
return *L1page + offset;
}
PAGETABLE is actually a register (%cr3 or something or other)
- Of course, it's privileged so processes can't go messing around with it
When the hardware faults due to a bad PTE (page table entry)
- hardware traps
kernel can:
- terminate process
- send SIGSEGV to process
- kernel arranges for RTI to signal handler
- arrange for invalid access to become valid
- load it into RAM
- i.e. if memory has been swapped off to disk (virtual, after all)
- modify page tables to reflect this
- RTI back to the failing instruction
In essence, this treats RAM as a cache for disk
- works if good locality of reference
however, there's thrashing if random access