CS 111 Lecture 14: Virtual Memory
    March 4, 2013
    by Sidney Chan
    
    Our Next Problem: Unreliable processes with bad memory
      references
    
      -  subscript errors
 
      - dereferencing bad pointers (null, uninitialized, freed, etc.)
 
    
         We've had these problems since the 1950s. 
     Some possible solutions include: 
    
      - Fire the programmers! (This doesn't scale.) Nobody should be
        writing code like this so don't do it. Unfortunately, these
        problems are hard to solve.
 
      - Add runtime checks to your code such as runtime subscript
        checking, etc. (This is fairly popular; the JVM does subscript
        checking by throwing an exception so you can't reference outside
        your bounds.)
 
    
    Why can't we just do runtime checks?
    
      - Performance issues: We may want to play with pointers for
        performance reasons, and also runtime checking is slow.
 
      - Trust issues: We have to know if a program is written in
        machine code, or if runtime checking was turned off by
        attackers.
 
      - A workaround: If the compiler is part of the OS, we can run
        only the programs that were self compiled (as in the Burrough
        B5500 and successors). 
 
    
    We need help from the hardware to do subscript checking, so the
      hardware will check the memory references.
    Simple approach: 2 extra registers, base and
    bounds
    
      - base points at the part of
        memory you're allowed to use right now
 
      - bounds points to the byte just
        after the one you can use
 
    
    
                  
    
    We have a hardware check for each memory access. If you access
    outside your bounds you trap and the kernel can decide what to do.
    
    Problems with this simple approach:
    
    
      - You must have just 1 region. (Possible solution: extend our
        machine to have multiple base-bounds pairs.)
 
      - Programs must fit into RAM; we can't have virtual memory.
 
    
    Suppose you're running a sort program and you want to run two
    instances at the same time:
                           
    
    
      - ip1 is at address 0x10000, ip2 at 0x12000
 
      - Programs cannot contain absolute addresses, as ip2 will have a
        jump to 0x10000 since it is a copy of the sort code. Relative
        addresses will work.
 
    
    To fix the absolute-address pointer, you can:
    
      - Change the way address arithmetic works (hardware) to always
        add the base to every address before you use it. This will work.
 
      - Mutate the program as you load it into RAM: the copy on disk
        has absolute addresses but you change them to relative addresses
        as you read the code.
 
    
    
    Suppose the program changes the base or bounds.
    
      - Solution: Have these registers set as privileged either by
        setting special privileged instructions or by making accesses to
        these registers be privileged.
 
    
    Suppose you want processes to share memory.
    If you have only 2 processes you can pull this off, but this won't
    scale to larger numbers.
    
      - Solution: have multiple base-bounds pairs.
 
      - add a system call to create a "domain"
 
    
           
                 
          map_domain( [size], [permissions] )
            
                 
                 
                 
                 
          x (execute)
            
                 
                 
                 
                 
          rx (read-execute)
            
                 
                 
                 
                 
          w (we can have this, but it's unusual)
            
                 
                 
                 
                 
          _ (for kernel use only) 
            
                 
                 
                 
                 
          rwx
            
                 
                 
                 
                 
          ...
    
    
      - These permissions need not be supported by hardware.
 
    
    Traditional x86 does r, w, but not x.
    
      - One former problem your stack was read/write which meant you
        could execute your stack, making buffer overflow attacks easy to
        do. Modern x86 can give the stack r/w but not x, so it will trap
        for unwanted behavior.
 
    
     
    
    Problem: If you have an arbitrary amount of base-bounds pairs, say
    4, this will not be enough domains. There are many possible uses for
    a domain, including: text of a program (r), data of a program (rw),
    heap (rw), stack (have 1 per thread), I/O buffers, shared memory. We
    have many uses for a base-bound pair.
    Segmentation (segmented memory)
    
    
    We choose a base-bounds pair from a segment table. Privilege is
    needed to change the segment table. The segment number is used to
    identify the segment, along with an offset within the segment. Each
    entry in the table includes information such as its base, bounds,
    permissions, and so on.
    
    
    Problems
    
      - What if you run out of segments?
 
      - Segmentation increases hardware complexity.
 
      - Growing a segment is expensive. This is the killer problem for
        segmented memory.
 
    
    
 
    
      - The kernel should arrange that programs never overlap.
 
      - More problems. Suppose we create a new object with new X(),
        then malloc(N) for A when the segment is full. Now we have to
        push C's data somewhere else. This is a pain. 
 
    
    Page-based Memory Management: the solution for segmented memory
    
      -  Now, we fix all the pages to the same size, say 8 KiB. This
        way, data for a process does not all have to sit next to each
        other in memory, making growing and moving easier. This also
        solves the problem of fragmentation.
 
      - We have a magic box that stores information about the mapping
        between physical and virtual memory.
 
      - Each process has its own page table, with a (privileged) page
        register that points to a process's page table. It is up to the
        OS to determine how to implement a page table.
 
    
    
    
    
    Linear page tables 
    
 
    
      - (in x86, page sizes are 4 KiB, with 20 bit page numbers + 12
        bit offsets)
 
    
    With linear page tables, we have a page table that holds multiple
    page table entries (PTE). In our example, the table holds 2^20
    entries of 4 bytes each, meaning that each process has a page table
    of 4 MiB. 
    
    
      - We must write a function pmap that takes as its argument a
        virtual page number and returns the page table entry for that
        page number. This example function is very simple:
 
    
    
           
              int pmap (int v)
            
              {
            
                 
          return PAGE_TABLE[v]; // this is stored in
      priviliged register %cr3 
            
              }
    
    
      - With linear page tables, for small processes, there will be
        many unused PTES in the page table. Thus, we waste a lot of RAM.
        To solve this, we use 2-level page tables.
 
    
    
    2-level page tables
    
    
    Here, instead of using the full 20 bits for page numbers, we use the
    first 10 bits for the first level page table and the next 10 bits
    for the next level. With two levels, we now have a more complicated
    pmap function.
    
           
              int pmap (int v)
            
              {
            
                 
          int hi = v >> 10;
            
                 
          int lo = v & (1 << 10) - 1;
            
                 
          pt *p = PAGETABLE[hi];
            
                 
          if(!p)
            
                 
                  return 0;
            
                 
          return p[lo] >> 8;
            
              }
    
    
      - There are now more memory accesses, so it's slower, but we
        have 3*4*2^10 bytes = 12 KiB per page table, which is much
        smaller. However, this won't help us for large processes. To
        deal with this, we may want to "lie" to the hardware. We might
        want to put into the page table entries the image of a smaller
        process than a process actually is. 
 
    
    
    
      - It is actually fairly common that a process thinks it has more
        memory than it really does (eg. it wants 3.5 GiB when it
        actually only has 100 MiB). In this process, only a portion of a
        process's pages are stored in main memory, with the rest in
        virtual memory. When a process requires data in a page that's
        not in RAM, you trap (since you are accessing invalid memory)
        and then retrieve the page from virtual memory to fix the page
        table.
 
    
                          
    
 
    
      - Downsides: Very slow! If we do a lot of page faulting it will
        be as if we had no RAM at all. Also, when we placed the new
        page, what happened to the useful memory that we overwrote? We
        have to push the old data and save it onto disk, which requires
        a write + read.
 
    
    Mechanism for page faulting
    When we page fault, we trap and enter the kernel. The kernel
    inspects the faulting address.
    
            void
      pfault(int v, int access, ... )
             {
            
              if(swapmap(v,
      current_process) == FAIL) // if bad address, eg. derefence a null
      pointer or subscript error
            
                 
          kill(current_process);
            
              else // valid address
            
              {
            
                 
          (p, va) = removal_policy();
            
                 
          pa = p->pmap(va); // pa: physical address
            
                 
          p->pmap(va) = FAULT; // tell that the page
      doesn't exist anymore
            
                 
          write(pa, swapmap(p, va));
            
                 
          read(pa, swapmap(currentprocess, v);
            
                 
          currentprocess->pmap(v) = pa;
            
              }
             }
    
      - In the above code, p is the process holding the victim page
        and va is the virtual address of the victim. removal_policy()
        picks a victim physical page and moves it into virtual memory,
        causing the process to become very slow. 
       
    
    What can we do wrong?
    
      - Suppose the code for pfault is held in a page that becomes the
        victim. Memory manager pages themselves must be "wired down":
        never selected as victims.
 
      - Can we make secondary tables victims? Sure, if the hardware
        will allow us to write a 0 in the entry for that table in the
        top level table. You can't make the top level table a victim,
        however.
 
    
    Virtual Memory Tuning
    
      - Demand paging (quite popular): when you start a program, load
        just its first page. The advantage of this is that you start up
        quickly, but if you only load pages just as they're needed you
        can cause the disk arm to move more.
 
      - Dirty bit: in your page table, you keep track (1 bit/page,
        every store insn can change it) of whether the page has been
        stored into since it has been read in. Now, if we need to get
        rid of a page, we know that we don't have to write it back to
        disk if it has not been modified.
 
    
    How can we optimize the dirty bit so we don't have to store 2
    positions in RAM?
    
      - We can avoid the necessity of computing the dirty bit on pages
        where we don't have write access.
 
      - We can store this bit in the page table entry since we have
        some extra room.
 
      - Even though the hardware doesn't give you a dirty bit, we can
        tell the hardware that a process has fewer write permissions
        than it really does by clearing the w bit in the PTE. There will
        be a trap, so the OS sets the dirty bit and w bit and returns
        and the process continues. The OS keeps track in its own area.