Lecture 15: Virtual memory and processes, distributed systems

Transcribed by Benjamin Statz and Junnan Wang

Page Faults

Virtual memory structure

Page faults occur when a process cannot access a page, whether that is due to lack of access privileges or the page simply not being loaded into RAM. Upon a page fault occurring, control is transferred to the kernel. which has 3 options:

Terminate process
Send SIGSEGV
Pull page into memory and retry the instruction (slowest, requires I/O)

What can an app programmer do in case 2? For example, upon a SIGSEGV, Emacs assumes a stack overflow has occurred and discards the stack. This is not a reliable way to detect stack overflow. Emacs uses guard pages which are set as unusable using Nmap, and trigger the SIGSEGV if they are attempted to be accessed - but an array larger than the size of the guard page can cause all indices of the guard page to be skipped over. The solution is simply for Emacs not to create large local arrays.

Page fault mechanics

What is stored in kernel memory?

Page tables
Swap map
- Given a process and a virtual address in the process, the swap map gives you the location of the corresponding disk address:
- da = swapmap(p, vpn)
  - Each process table entry contains data for the disk location of the process in swap space
  - The disk address is calculated as (4096 * vpn) + an offset equal to the aforementioned disk location

On a page fault (assuming a valid virtual address):

There is insufficient physical memory - page is on disk
Must read page in from swap map
- Problem: must read into a location in physical memory
- As the system keeps running, memory fills up - must pick a location that already contains data
Solution: before reading page, choose victim page (p', vpn')
1. Write from physical page to swapmap(p', vpn')
2. Update old page table entry to be NULL (no longer in RAM, meaning it should page fault if accessed)
3. Update page table entry for new page
How long does all of this take? On the order of milliseconds, which is fairly long
- Other threads can be scheduled in the meantime
- Can break into pieces and schedule in parallel

Virtual Memory Policy

How is the victim page selected? There are several possible page replacement policies:

First in, first out (FIFO), aka victim++
- Fair policy - no page gets to monopolize memory
- Problem: evicts commonly used pages just as often as uncommonly used pages
Random replacement
- victim = random(physsize), to produce a value between 0 and the number of physical pages
- Same problem: evicts commonly and uncommonly used pages equally often
Oracle policy
- Look ahead in future program execution and see which pages are needed in the future
- Victim = page least needed in the near future
- The past is not considered in this policy
- Cannot be used by practical operating systems, as they cannot predict future behavior
madvise
- System call which gives information to the kernel about future memory usage
- Syntax: int madvise(void *addr, size_t length, int advice)
  - addr: start of virtual address
  - length: number of bytes in memory to consider
  - advice: one of several possible constants which describe the order in which memory will be accessed.
    - Examples include MADV_SEQUENTIAL, MADV_RANDOM, and MADV_NONE; a full list can be found here
- Has existed for a long time, but many apps don't use it
- Often doesn't help - following its advice can potentially make program execution worse
- Program can lie and claim to be more important / more highly used than it actually is, giving its pages unfair preference
Least recently used (LRU)
- Choose the page that has not been used for the longest time to be the victim
- Most commonly used policy

Examples:

The same reference string is chosen to test all page replacement policies. For this section, we will assume there are 5 virtual pages (labeled 0-4) and 3 physical pages (labeled A-C), and we will use the following reference string:
0 1 2 3 0 1 4 0 1 2 3 4
In actuality, the reference string is determined by running various programs and keeping a log of pages used. The objective is to minimize overhead by reducing the number of page faults, which are represented by deltas in each table.

FIFO

Cost = 9 page faults

Oracle method

Cost = 7 page faults

Least recently used

Cost = 10 page faults (this decrease in performance is somewhat of an irregularity)

FIFO, with 4 physical pages

Cost = 10 page faults. In this scenario, adding memory actually made performance worse. This is known as Bélády's anomaly.

Implementing LRU policy:

The optimal setup would be to have a dedicated column in the page table to store a timestamp at last access for each page. However, this is not supported by x86, so a separate clock table array must be used. To enable kernel access to this table, set each page in the page table to be inaccessible. When an access attempt is made, a page fault will result. When this happens, set the page to its proper read-write status, so the access can be made, and at the same time set the entry in the clock table.

Virtual Memory Optimizations

Demand paging

Rather than reading an entire program into physical memory, read only page 0, and copy additional disk pages into physical memory only if a page fault occurs due to an attempt to access them
Pros: only loads the required pages - avoids unncessary work when only small sections of a program are needed
- Example: gcc --version will only load the necessary pages to determine the version number, a very low amount compared to the full program
Cons: cannot batch reads, causing extra overhead (more page faults, extra disk arm movement to access data on disk in smaller amounts)

Dirty bit (and related concepts)

When a page fault occurs, the old page is written back to disk, and the replacement page is read into memory. The writing step can be redundant if the contents of RAM have not been modified and already match the disk contents. How can this be made more efficient?

Read before writing
- Before writing to disk, directly compare the contents of the disk and RAM to determine whether the write needs to occur
- Inefficient - involves even more I/O than before (extra read)
Use a checksum
- Collisions can occur - modified RAM data may have the same checksum and will not be written to disk, losing data
- Solution: use a cryptographic checksum
- Regardless of accuracy, checksums are expensive - checksum table is proportional to size of virtual memory
Dirty bit
- Most common optimization for this scenario
- Store an extra bit in each page table entry: 0 if the page has not been modified since it has been read from swap space, 1 if it has been modified
- Requires hardware cooperation, but it is not a performance problem, since the dirty bit changes very infrequently
- Can be implemented without a designated dirty bit in the page table, even though it requires kernel access. To do so, use a similar method to the LRU implementation:
  - Mark each page in the page table as read-only
  - When a page access is attempted, a page fault is triggered, allowing the kernel to set the dirty bit in a separate table

fork() optimizations and vfork()

In more primitive implementations of fork(), the entire process is cloned, including the page table and the data - no memory is shared. This is very expensive/inefficient - when a large program implements a smaller one, the majority of the data is discarded when the child calls close() and exec().

The solution is to copy only the page table and have the child's copy point to the same data pages as the parent's. Then, both page tables are set to read-only, which will cause a page fault if any write attempts are made. On a page fault, the page is first set to read-write as it should. Then, a copy of only the modified page is made, and the child page table is updated. This is how modern implementations of fork() function.

vfork is identical to fork, except for 2 differences:

Parent and child share memory - vfork clones neither the page table nor the physical memory data
Parent is frozen until child either execs or exits - this avoids race conditions where the parent and child are accessing the same memory

It is no longer POSIX supported, since fork() has become more efficient since it only copies on write, and modern CPUs are fast enough to clone the page table with a negligible effect on performance.

Intro: Distributed Systems

Distributed systems involve running an OS whose components are not all on the same computer. One way of managing this is to use remote procedure calls (RPCs).

Remote procedure calls

RPCs differ from ordinary function calls and system calls in several ways:

No shared memory
- Only way for the callee to access data is to copy it
- Example: strcmp(a, b) will not work if one is located on the caller machine and the other on the callee
- Pros: hard modularity is enforced
- Cons: difficult to access data
No call by reference
- For instance, the system call ssize_t read(int fd, void *buf, size_t count) will not work as an RPC since buf is a memory location
Caller and callee can have different CPUs or disagree on instruction sets / data formats (e.g. little vs. big endian, different CPU architecture)

RPCs can be simulated as if they were ordinary function calls using the principle of marshalling, in which complex structures are transmitted bit by bit over the network. In order for the original program to remain unmodified, an additional layer of functionality is inserted between the program and the network, which is responsible for marshalling. For instance, a system call of read() sent from the program would be modified to remove the original reference to a memory address, then converted to a series of bits encoding the call and its arguments. Likewise, an additional layer is needed to decipher the RPC and convert it back to an ordinary system call before it is passed to the server.

Problems with RPCs:

Messages can be lost or corrupted
Network or server might be down, or slow

Beginning next lecture: how to fix these problems