Week 9 Lecture 1 Scribe Notes

Scribe: Matthew Nelson

Sections:

Robustness: Commit records; Journal
Peeking inside transactions
Virtual memory

Two ideas for surviving power failures (and other robustness challenges)

Idea #1: a commit record
e.g.
1. write blocks from RAM to copy area
2. wait for blocks to hit disk
3. write a separate commit record
  This says where to find the file
4. copy from the copy area to the original area
  File contents changed, location moved
5. clear commit record
  File contents new, location unchanged
General pattern
1. Pre-commit phase
  during this time, you can easily back out with no change
2. Commit or abort
  - If you commit, the data is written to the original area
  - If you abort, the commit record is removed
3. Post-commit phase or post-abort phase
  done for efficiency, typically
Idea #2: a journal
2 areas
- Cell data
  blocks of data in the file system to be updated one at a time
- Journal
  Goes on “forever”
  - bounded – wraps around to the beginning
    garbage collection maintains the list
  List of proposed changes to cell data
  - for example, write a commit record
Advantages:
- Writes can be contiguous bursts of activity
  This requires that the journal is in a contiguous piece of disk
- Journal can be located on a different device
  e.g. have the journal on an SSD
- There is no need to put cell data on disk at all
  Cell data can be kept in RAM
  - writes to journal must happen, but writes to data can be delayed
Two different types of logs
- Write-ahead logs
  1. log all writes you plan to do
  2. commit
  3. install new values into cell data
  - must be replayed left-to-right
- Write-behind log
  1. log all the old values into the journal
  2. install new values into cell data
  3. commit
  - must be replayed backwards
  - Tends to be faster as the stuff that’s needed is at the end
  - Repairs must be idempotent
    You have to be able to repair even after a failed repair due to another crash, etc.

A more complicated approach to robustness: Let apps “look inside” pending transactions

Pros:
- Better parallelism/performance
- Lower latency
Cons:
- Need a way to notify apps that a transaction they peeked inside was aborted
  e.g. a signal
- Cascading aborts
  If process D depends on process C which depends on B which depends on A,
  then if A's write is reported as failed after D wrote to disk...
- Compensating actions
  1. Process A writes
  2. Process B peeks inside A’s write, writes, commits
  3. Process A aborts
  4. Process B already committed and must remedy the problem

Virtual memory

Problem
We have unreliable programs that create bad memory references
- e.g. GNU grep 2.10 on 64-bit hosts
  if run on a bare machine, whole system can fail
Solution
1. Hire better programmers!
  This solution has been tried – it sometimes works, but it costs $$$
2. Use run-time checks in compiler-generated code
  Works, but slower
3. Base & bounds register pair
  - First register points to first spot in RAM that can be used
  - Second register points to the last spot in RAM that can be used
  - Hardware checks that all memory references fall in this range
  - If you use an invalid spot in RAM, hardware traps, kernel boots you out
  - Problems
    - You need to preallocate all RAM
    - You can grow memory, but it’s awkward
    - RAM limits efficient use of processes
    - Memory gets fragmented
      Both internal and external
    - Applications can change their own boundary registers
      Just make them write-privileged
    - No memory-sharing
      You can have two overlap, but three is right out
4. Multiple base & bound pairs
  - for text (read-only)
  - for data (read-write)
  - Problems
    1. Position-dependent
      Pretend we’re running cat
      text start at 0x30000, up to 0x40000
      Now when we compile cat, you must tell it where to run
      - Solution: design the program so it doesn’t care where it runs
        relative jumps, not absolute
        Position-independent code (PIC)
      - Alternate solution: you can dynamically link or patch
    2. Fragmentation
      Can’t grow/shrink segments easily
      - Solution:
        use fixed-size pages
        e.g. 4KiB
        Virtual address
        20-bit page number
        Refers to the page table
        Page #0 to 2^20-1
        Each entry in the page table, in turn, points to a page of physical memory
        These are also 4KiB in size
        These are represented in the page table by 20 bits of page number and 12 bits of flags
        size of page table = bounded
        4MiB page table
        12-bit offset
        This offset is carried over directly as the offset in the phsyical page
        
        This requires hardware assistance
        Common situation: VM for one process
        forbidden | text | forbidden | data -> | forbidden | <- stack
        forbidden areas are large compared to all data
        
        We can make the page table smaller by having a 2-level page table
        page size = 4KiB
        master table points to intermediate tables
        intermediate tables point to actual addresses
        only have an intermediate table when the memory it points to is actually being used
        saves space
        virtual address with intermediate page number:
        x86 MMU (memory management unit) can handle this

Software emulation of 2-level page table (or at least an approximation)

size_t pmap(size_t va) {
- int offset o = va & (1<<12)-1;
  int lo = va >> 12 & (1<<10)-1;
  int hi = va >> 22;
  size_t *L0page = PAGETABLE[hi];
  if(L0page == FAULT)
  - return FAULT;
  size_t *L1page = L0page[lo];
  if(L1page == FAULT)
  - return FAULT;
  return *L1page + offset;
}
PAGETABLE is actually a register (%cr3 or something or other)
- Of course, it's privileged so processes can't go messing around with it

When the hardware faults due to a bad PTE (page table entry)

hardware traps
kernel can:
- terminate process
- send SIGSEGV to process
  - kernel arranges for RTI to signal handler
- arrange for invalid access to become valid
  1. load it into RAM
    - i.e. if memory has been swapped off to disk (virtual, after all)
  2. modify page tables to reflect this
  3. RTI back to the failing instruction In essence, this treats RAM as a cache for disk
    - works if good locality of reference
      however, there's thrashing if random access