Lecture 14: Scribe Notes for May 21, 2013
By Rhee, Min Hyoung
----------------------------------------------------------------------------------------------------------
Table of Content
1. What happen if system crash and reboot during write?
1.1 What happen if system crash and reboot in the process of writing to directories?
2. Proposed POXIS Model
2.1 Commit Record
2.2 Journals
2.3 Write-ahead log
2.4 What happen if system crash during recovery procedure like fsck?
2.5
Radical proposal: log structured file system
2.4 Write-behind log
3. Unreliable processes containing bad memory references
3.1 Base and Bounds: two new registers
3.2 Segments: multiple base-bounds pairs
3.3 Pages
3.4 Page Table
-----------------------------------------------------------------------
What happen if system crash and reboot during write?
Example:
write(fd,buf,bufsize) where bufsize=10,000,0000
Let's say we are writing blocks A,B,C,D...and so on.
Below is example of what memory looks like after executing writes.
DEFGH |
ABC |
MNOP |
IJKL |
However what happen if system crash in the middle of writing data.
Possible example of how memory looks like after system crash during the write: |
ABC |
MNOP |
IJKL |
Therefore, if system crash and reboot during writing data, there is possibility of writing incomplete data. (i.e. missing some data)
What happen if system crash and reboot in the process of writing to directories?
Example: Let's say we have file name "a" and "b". We want to rename "a" into "b", which will remove old "b".
rename("a", "b")
"a"; 27 |
"b"; 65 |
0 |
"b"; 27 |
"b"; 27 |
0 |
0 |
"b"; 65 |
"a":27 |
"b"; 27 |
Proposed POXIS Model
RAM in memory cache of:
regular files: blocks of a given size.
file attributes: individual files
directories: directory entries
syscalls:
fsync(fd)-synchronizes data + attribute(like date last modified)
fdatasync(fd)-cheaper, jusy synchronizes data (for performance, we accept some incorrect information in data's attribute)
However both of them are slow since disk arm take 'ms' to move.
Ideas for atomic rename()
1) COMMIT RECORD - seperate spot on file system that tell rename is completed or not.
if not commited, it will do fsck and pretend rename never happen.
if committed but system crash after, it will do some clean up for rename.
rename(a,b)
BEGIN
pre-commit phase (some possible problem: name is too long and disk does not have enoufh space for it, do not have permission)
COMMIT(atomic, which mena done or not done) OR abort (if problem occured in pre-commit phase)
post-commit phase (clean up, optimization)
END
2) Journal
two parts in our file system.
one is cell data that contain just data that looks like this:
Cell 0 |
|||||||||
Cell 10 |
Cell 13 |
||||||||
Cell 20 |
|||||||||
Cell 30 |
Cell 37 |
||||||||
Cell 40 |
|||||||||
Cell 50 |
|||||||||
Cell 60 |
|||||||||
Cell 70 |
|||||||||
Cell 80 |
|||||||||
Cell 90 |
other part is journal (in simplest model) has properties:
Write-ahead log: one way to use journal
log the change you want to make:
log commit
write the data
log END
Example:
BEGIN
13="B"
37="0"
COMMIT <--------------most delicate place for system crash is between COMMIT and END.
END
If system crash occured between COMMIT and END, log will look like this:
BEGIN
13="B"
37="0"
COMMIT
Recovery procedure will replay the log left to right (in this case top to bottom) and when it see
COMMIT but no END, it will just proceed END anyway.
What happen if system crash during recovery procedure like fsck?
In order to avoid this problem, we should recover using idempotent operations
(idempotent operations mean doing same operations twice will be same as doing it only once).
Then, we can just run fsck again when system crash during recovery procedure.
However, in some complex case, we cannot make it idempotent. In case like this, we can put what fsck will do into log, and
if system crash, we look at log to know where to start.
*Note that journal is overhead and cost us some performance issue.
Radical proposal: log structured file system
pro(+): writes are really fast because no seeking neccessary! (because we are only writing journal and journal is appended).
con(-):reads can take while (if data is not in cache because disk arm need to move from journal to data cell).
Write-behind log
13: "A"->"B"
37: "B" -> 0
BEGIN
13:"A"
37:"B"
now make the changes to cells
COMMIT
END
If system crash occured, and when fsck does not see COMMIT, then it replays log backward replacing values to old values.
example: research file system
120PB, 200,000 hard drives(each 600GB, so faster drives and consume less power than one 120PB hard drives)
GPFS (General Parallel File System) - one file systeam handling 200,000 hard drives will be overkill, so we use multiple file system instead.
Next Problem: Unreliable processes containing bad memory references
For example:
a[i] = 27
Here, what if a[i] is out of range?
Solutions
Instead we have better idea. we get help from hardware not by adding new instruction,
but by adding two new registers, base register and bounds register that contain base value and bounds value.
Base and Bounds: two new registers
RAM:
base: address of 1st byte you want system to be able to access.
bounds:address of 1st byte you don't want system to be able to access.
hardware checks all memory access to fit in here - if check fails, hardware will trap.
This is not as good as SUBSCRIPT instruction because arrays in same process can still mass with each other.
Example: what happen if we use abolute address in machine code?
sort a > aout
sort b > bout
so there cannot be absolute address using this method.
Shared memory:
We can use shared memory but it is very limited (not flexible).
Segments: multiple base-bounds pairs
Segment #: 8 bits |
Segment offset: 24 bits |
2^8 = 256 pairs of base and bounds
requires hardware supports.
each process has its own segment table.
segments have varying size (so we have holes between segments).
Using segments solve problems of
However, segments still have problems of fragmentation problem.
Pages
Pages - like segments except all pages are same size (typically 4KiB)
Page #: 20 bits |
Page offset: 12 bits
|
We can grow domain by simply adding another pages.
Page table
page table: info about page
*Note that modifying page table is previlege instruction.
*Also system need to make sure physical address in page table does not point to
address
that is allocated for page table because then user can modify page table without using previlege instruction.