Virtual
Memory
By Avinash Kumar, Nithin Reddy and Vineet
Pandit.
One level page
table model:-
·
Maps the virtual
pages with the physical RAM.
·
There could be empty virtual pages.
off_t swapmap (void * address, pte_t *p)
{
return p->swap + (intptr_t) address
& ~(pagesize-1);
}
·
void* address is the address
of the failing page.
·
pte_t
p is the process
table entry.
o This
parameter is required if 2 processes page fault at the same time on the same
computer. This way the OS has a way to find out where the virtual memory lives
on the disk.
·
This routine should
be called when there is a page fault.
·
For example: - There is a 0 in the page table
entry.
·
There is a
routine that the kernel runs when there is a page fault.
void pfault(void *addr, ptet* p )
{
off_t o =
swapmap(addr, p); //
finds out where on disk the OS is going to swap.
if(o == -FAULT) //
doesn’t swap to anywhere invalid page.
{
kill(p, SIGSEGV); //
bad process (just pseudocode)
}
else //
we want to copy the page entry from the hard disk to the RAM, but in actuality
there is never any unused RAM. So
//
we look through physical memory and pick
a victim page that’s in use and is not used/will not be used in the near
//
future (page_removal_policy)
{
victim_page,
victim_process = page_removal_policy(); //
victim page: owned by victim_process
pa = pmap(victim_page, victim_process
); //
pa = physical page address
write page pa to disk @
swapmap(victim_page, victim_process); //
save old page to disk
read page pa from disk @ swapmap(addr,
p); //
save new page to memory
p->page_table[addr>>page_bits]
= pa; //
update new page table
victim_process
->page_table[victim_page ] = FAULT; //
update old page table
}
}
·
When the kernel
traps on a page fault, the hardware saves
the instruction that faulted in the
program counter.
·
In other words, on return from interrupt the
same instruction is retried.
·
When the same instruction is run again, this
time the page table will work.
Q: Say one process faults, should entire
kernel freeze and wait for the write and read to work?
A: Of course not.
While we’re handling the page fault we shouldn’t stay inside the kernel. We
should let other processes run because it’s going to take a while for this read
and write to work. Even though the read and write look like subroutines, in
actuality, it’s going to cause us to return from the interrupt and let somebody
else run. When the write finishes we get another trap and go onto say it’s time
to do a read, then while were reading we let other processes run here, then
there’s another trap and we’ll update the page table entries.
·
As we mentioned,
it’s possible to have two processes that have read only access to the same
page, it’s possible that they can both page fault at same time. They’re reading
a page that isn’t there. In that case
you have to be careful when you’re looking for victim pages and not step on
your own work. This makes your implementation more complicated.
The page_removal_policy
·
How do we decide
which page is the victim?
·
2 basic properties of almost all real world
applications
a)
Spatial locality
: accessing page i implies that you will soon access page i+ 1, I-1 i+2 i-2 i+ i-
b)
Temporal
locality: if you access a page i at time t then you will most probably access
the same page at time t+
·
In practice these
notions are also combined.
·
Now we have to
implement page_removal_policy
·
A terrible algorithm (simplest one) is First
in first out: whatever page got into system first is the first to get paged
out.
·
Best policy is to get the future and remove
the last used page first ("oracle" algorithm)
·
Unfortunately oracle's aren’t around In the
real world.
Testing the page_removal_policy
·
We run an
application on a busy server and we'll see what the reference string is.
o
A reference string is a list of page numbers
of an actual app and we'll use this reference string for our
page_removal_policy.
·
Ideally we’ll have a database of reference
strings, each with an importance attached to it. We’ll take these reference
strings and try them out with our algorithm.
·
We can have a lot of reference strings for
other apps, oracle, apache, mysql, firefox
·
Imagine a small
reference string with 5 memory locations
o
5 pages small virtual memory 0 through 4 ->
0 1 2 3 0 1 4 0 1 2 3 4
o
3 pages of RAM
o
Assuming First In First Out
Page |
0 |
1 |
2 |
3 |
0 |
1 |
4 |
0 |
1 |
2 |
3 |
4 |
1 |
Δ0 |
0 |
0 |
Δ3 |
3 |
3 |
Δ4 |
4 |
4 |
4 |
4 |
4 |
2 |
- |
Δ1 |
1 |
1 |
Δ0 |
0 |
0 |
0 |
0 |
Δ2 |
2 |
2 |
3 |
- |
- |
Δ2 |
2 |
2 |
Δ1 |
1 |
1 |
1 |
1 |
Δ3 |
2 |
This method
results in 9 page faults (cost of
running the program)
·
Ways to reduce
page faults (make it go faster)
·
2 ways to attack this problem
§ brute
force :simply increase RAM
§ Supposed
we had 5 pages of RAM, there would be 5 page faults.
§ Now
with 4,
Page |
0 |
1 |
2 |
3 |
0 |
1 |
4 |
0 |
1 |
2 |
3 |
4 |
1 |
Δ0 |
0 |
0 |
0 |
0 |
0 |
Δ4 |
4 |
4 |
4 |
Δ3 |
3 |
2 |
- |
Δ1 |
1 |
1 |
1 |
1 |
1 |
Δ0 |
0 |
0 |
0 |
Δ4 |
3 |
- |
- |
Δ2 |
2 |
2 |
2 |
2 |
2 |
Δ1 |
1 |
1 |
1 |
4 |
- |
- |
- |
Δ3 |
3 |
3 |
3 |
3 |
3 |
Δ2 |
2 |
2 |
This
results in 10 pagefaults.
3 pages of RAM meant 9 page faults, but with 4
there are 10 page faults.
Anomaly more RAM but slower system
Belady's anomaly: does not always help to
increase memory or cache.
The best possible policy would be the ‘Oracle’
Algorithm
It looks to the future and discard the page
that is not needed for the longest time but this algorithm is impossible to
implement.
Using this algorithm on our reference string.
|
The page # in
memory being requested |
|||||||||||
Page |
0 |
1 |
2 |
3 |
0 |
1 |
4 |
0 |
1 |
2 |
3 |
4 |
A |
Δ0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Δ2 |
2 |
2 |
B |
- |
Δ1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
Δ3 |
3 |
C |
- |
- |
Δ2 |
Δ3 |
3 |
3 |
Δ4 |
4 |
4 |
4 |
4 |
4 |
We get 7 page faults in the 3 page machine.
In the real world getting the future is not
possible. So what we can do is approximate the future using the things that are
given to us.
One of the common algorithms used to do that
is ‘Least Recently Used’ (LRU) algorithm.
It assumes that a page not looked at in the
past won’t be used in the future.
This algorithm assumes temporal locality.
Using LRU.
page |
0 |
1 |
2 |
3 |
0 |
1 |
4 |
0 |
1 |
2 |
3 |
4 |
|
A |
Δ0 |
0 |
0 |
Δ3 |
3 |
3 |
Δ4 |
4 |
4 |
Δ2 |
2 |
2 |
|
B |
-- |
Δ1 |
1 |
1 |
Δ0 |
0 |
0 |
0 |
0 |
0 |
Δ3 |
3 |
|
C |
-- |
-- |
Δ2 |
2 |
2 |
Δ1 |
1 |
1 |
1 |
1 |
1 |
Δ4 |
|
10 page faults on a 3 page machine.
For this particular string LRU is worse than
FIFO.
Implementing LRU:
We cannot implement LRU using the data
structures we have in our page fault algorithm. There is not enough information
in the tables that tells you which page is least recently accessed. So we have
to add code to the kernel to tell what page was least recently accessed.
How to record when page tables are accessed
You could use a
real time clock adding a clock value to the page table entry.
Disadvantage:
This causes the PTE
to grow by 32 bits (considering a 32 bit clock). This is very expensive as PTEs
take up RAM.
Instead of using a
real time clock there could be a clock with less number of bits and have that
clock tick once per page fault.
Example: use very small clock(low res) (Maybe a 4
bit clock). Ideally store the clock in the few extra bits in the page table
entry.
Simple model: Every time the clock ticks it will
increment all the entries by 1. Any page table entry with a number more than 15
is counted as old. Incrementing all the entries is a slow process.
Alternative model: Instead, The clock wraps around. Doesn’t record how
old the page is, records last time page was accessed.
Disadvantages:
This won't take
care of the accesses without page faults. Thus it ends up ignoring most
accesses.
Every time page fault: count as access
Since kernel is consulted every page fault
Don’t count just accesses that fault, count
accesses that DIDN’T page fault
Looked at page, already in ram, kernel not
consulted: this doesn’t do that
So this is a
terrible algorithm for LRU
There are two other
approaches
Software: kernel
traps for many reasons. =>hardware and clock interrupts. It traps every 50th of a second
automatically.
So on every trap count accesses
-
From kernel
-
From applications
Hardware: every time a page table entry is
used, a bit is cleared in the page table entry. (dirty bit)
Instead have a single bit in the page table entry
called dirty bit. Hardware maintains the dirty bit by setting it everytime
written to the page entry.
a. If not maintained by
hardware. Then mark clean pages as read only then writing causes faults so the
page fault handler sets the dirty bit marks read write and returns.
How to tell if a page has been modified
Keep copy of page
in ram, then
1. Compare to a copy
in RAM => (works
but too expensive)
2. Keep checksum (say
md5) of page in RAM in PTE: if matches, hasn’t changed, move on
Problems
a. Chance of
collision: 2 different blocks of memory hashed to same value, and will lose
data
b. Checksumming
process is slow. (walkthrough whole array)
=> (consume CPU count)
3.
Instead have a single bit in the page table entry
called dirty bit.cleared if page is never changed, set when changed
***************break*******p
Assume C== 10ms to
read a random page
F = 0.1ms cPU
overhead to do a page fault
N == 10000pages in
the program
U = 1000 pages
actually used
U<=N
Assume no thrashing
Which is faster:
1. read whole
program into RAM then execute it (no page faults)
2. Read first page
into ram and start running it (page faults)
3. Read all N pages
into RAM
|
Total Cost |
Latency (Time to start a program) |
No demand Paging |
NC |
NC |
With Demand Paging |
U(C + F) |
C + F |
Fault
in pages as needed
Latency:
amount of time it takes to start up
program (for example: time from user click on icon until splash screen)
*********SEE
COMMON OPTIMIZATIONS spring 09***************
Program
into ram: assume that there’s no thrashing,
Read
all N pages into ram: then finished, no page fault
other
option:
demand
paging: don’t read program into ram
sometimes
total cost is greater with demand paging ( if U == N)
-demand
paging slower
another
sense: using same cost to read a random page
if
on disk, much cheaper to read continuous pages than it is to read same number
of random pages because pages are contiguous
all
these numbers are fair on flash, but on disk this number is considerably
smaller. Therefore sometimes ordinary paging will win
fork
vs vfork
We
know about fork
Vfork:
A. the child runs and
parent doesn't (parent is frozen, can’t even do a wait) until the child does
1. Exit
2. Does an exec.
3. Child has to run
some other program
B. Child and parent
share all the memory until 1. or 2. occurs.
-So no race conditions occur due to this shared
memory, as child and parent can’t possibly run at the same time.
What's the point
Why
vfork?
Vfork is only good
if you want to do an exec
Or
other things, such as Closing things (e.g. pipes) => [done in child]
B/c
they share memory they share page tables. So we don't need to copy all the
memory to the clone b/c memory is shared. Vfork is a cheap fork b/c of the
existence of virtual memory.
Do
everything with vfork that you do with fork, except that you don’t have to copy
pages tables, therefore it runs FASTER
Forks
in general:
Nowadays:
people don’t copy the RAM, but a fork now clones the page tables and marks them
as read only and copy_on_write which copies the pages if write is called in the
child.
To
make fork go faster on modern machines: mark the copy to be read only and copy
on write. When you try to write into these pages tables, you will fault.
i.e.
EMACs: uses vfork
Reason
of controversies
Fork
vs vfork. We have traded complexity for speed.
Distributed Systems
You
can do virtual memory with no disk: “over the net”
This
works (paging is more expensive)
Cost
to do a page fault is higher
Need
support for dist. Systems:
Read_over_the_net()
syscall
Write_over_the_net()
syscall
Remote Procedure
Call (RPC)
Caller on machine A
Callee is on
machine B (machine with disk)
Like
an ordinary function call
1. But hard modularity
2. No call by
reference: only call by value (caller and callee do not have same address
space)
3. Caller could be
x86, callee could be MIPS: don’t need same instruction sets
Disadvantage
1. Performance
downside: Slow.