Lecture 14: Scribe Notes for May 21, 2013

Lecture 14: Scribe Notes for May 21, 2013
By Rhee, Min Hyoung

----------------------------------------------------------------------------------------------------------

Table of Content

1. What happen if system crash and reboot during write?
1.1 What happen if system crash and reboot in the process of writing to directories?

2. Proposed POXIS Model
2.1 Commit Record
2.2 Journals
2.3 Write-ahead log
2.4 What happen if system crash during recovery procedure like fsck?
2.5 Radical proposal: log structured file system
2.4 Write-behind log

3. Unreliable processes containing bad memory references
3.1 Base and Bounds: two new registers
3.2 Segments: multiple base-bounds pairs
3.3 Pages
3.4 Page Table

-----------------------------------------------------------------------

What happen if system crash and reboot during write?

Example: write(fd,buf,bufsize) where bufsize=10,000,0000
Let's say we are writing blocks A,B,C,D...and so on.
Below is example of what memory looks like after executing writes.

Memory after writes:

DEFGH

ABC

MNOP

IJKL

However what happen if system crash in the middle of writing data.

Possible example of how memory looks like after system crash during the write:

ABC

MNOP

IJKL

Due to system crash during write, some of the data were not written to the memory.

Therefore, if system crash and reboot during writing data, there is possibility of writing incomplete data. (i.e. missing some data)

What happen if system crash and reboot in the process of writing to directories?

Example: Let's say we have file name "a" and "b". We want to rename "a" into "b", which will remove old "b".
rename("a", "b")

Directories Before rename("a", "b"):

"a"; 27

"b"; 65

Directories After rename("a", "b"): one possibility

"b"; 27

Directories Before rename("a", "b"): another possibility

"b"; 27

Note in both cases, rename require two writes: one for "a" and other for "b".
However, what happen if system crash between two writes?

Case 1: Directories after write to "a", system crash and reboot (before write for "b" execute).

"b"; 65

As a result of rename+crash, we just lost file "a", which is same as unlink("a").

Case 2: Directories after write to "b", system crash and reboot (before write for "a" execute).

"a":27

"b"; 27

This time we have two directories, pointing to same inode.
This is same as unlink("b") + link("a","b"), but we have wrong link count.
(Note: we are not allow to just link("a","b") when "b" already exist so we need to unlink("b") first )
Link count should be 2 but in this case, link count is still 1 and it will cause memory curruption.
However, this is less serious problem compare to losing file in case 1.
Therefore, we let fsck take care of this problem (fsck checks file system at reboot).
This solution is not really efficient since many people do not like to fsck because it is slow process.

Proposed POXIS Model

RAM in memory cache of:
                       regular files: blocks of a given size.
                         file attributes: individual files
                         directories: directory entries

syscalls:
fsync(fd)-synchronizes data + attribute(like date last modified)
fdatasync(fd)-cheaper, jusy synchronizes data (for performance, we accept some incorrect information in data's attribute)
However both of them are slow since disk arm take 'ms' to move.

Ideas for atomic rename()

1) COMMIT RECORD - seperate spot on file system that tell rename is completed or not.
if not commited, it will do fsck and pretend rename never happen.
if committed but system crash after, it will do some clean up for rename.

rename(a,b)
BEGIN
   pre-commit phase (some possible problem: name is too long and disk does not have enoufh space for it, do not have permission)
    COMMIT(atomic, which mena done or not done) OR abort (if problem occured in pre-commit phase)
   post-commit phase (clean up, optimization)
END

2) Journal
two parts in our file system.
one is cell data that contain just data that looks like this:

Cell 0
Cell 10	Cell 13
Cell 20
Cell 30		Cell 37
Cell 40
Cell 50
Cell 60
Cell 70
Cell 80
Cell 90

other part is journal (in simplest model) has properties:

it is infinitely long.
append only storage - only allow writing data to the end, no modification is possible.
random access(for reads)

Write-ahead log: one way to use journal

log the change you want to make:
log commit
write the data
log END

Example:
BEGIN
13="B"
37="0"
COMMIT   <--------------most delicate place for system crash is between COMMIT and END.
END

If system crash occured between COMMIT and END, log will look like this:

BEGIN
13="B"
37="0"
COMMIT

Recovery procedure will replay the log left to right (in this case top to bottom) and when it see
COMMIT but no END, it will just proceed END anyway.

What happen if system crash during recovery procedure like fsck?

In order to avoid this problem, we should recover using idempotent operations
(idempotent operations mean doing same operations twice will be same as doing it only once).
Then, we can just run fsck again when system crash during recovery procedure.
However, in some complex case, we cannot make it idempotent. In case like this, we can put what fsck will do into log, and
if system crash, we look at log to know where to start.
*Note that journal is overhead and cost us some performance issue.

Radical proposal: log structured file system

keep journals on disk - so we only writes to journal
put cell data cache in RAM - cell data is too big to put into cache, so we only bring neccessary cell from disk to RAM.

pro(+): writes are really fast because no seeking neccessary! (because we are only writing journal and journal is appended).
con(-):reads can take while (if data is not in cache because disk arm need to move from journal to data cell).

Write-behind log

13: "A"->"B"
37: "B" -> 0

BEGIN
13:"A"
37:"B"
now make the changes to cells
COMMIT
END

If system crash occured, and when fsck does not see COMMIT, then it replays log backward replacing values to old values.

example: research file system
120PB, 200,000 hard drives(each 600GB, so faster drives and consume less power than one 120PB hard drives)
GPFS (General Parallel File System) - one file systeam handling 200,000 hard drives will be overkill, so we use multiple file system instead.

distributed metadata (good performance)
efficient directory indexing
distributed locking - allow more parallelism
partition awareness
system stays live during fsck - fsck will take way too long and we do not want systeam to be unable to use for long time.

Next Problem: Unreliable processes containing bad memory references

For example:
a[i] = 27
Here, what if a[i] is out of range?

Solutions

hire better programmer - does not work! (most, and probably all programmers make mistake sometimes)
tell compiler to insert subscript checks (software check) - cause performance problem due to overhead.
get help from hardware - add SUBSCRIPT instruction to hardware (however, even this is too slow).

Instead we have better idea. we get help from hardware not by adding new instruction,
but by adding two new registers, base register and bounds register that contain base value and bounds value.

Base and Bounds: two new registers

RAM:

base: address of 1st byte you want system to be able to access.
bounds:address of 1st byte you don't want system to be able to access.

hardware checks all memory access to fit in here - if check fails, hardware will trap.

This is not as good as SUBSCRIPT instruction because arrays in same process can still mass with each other.

you must know in advance how much memory you need (you cannot grow memory for a process but you can shrink.)
fragmentation
cannot use absolute addresses in machine code - use position independent code (PIC)
sometimes we want shared memory

Example: what happen if we use abolute address in machine code?
sort a > aout
sort b > bout

so there cannot be absolute address using this method.

Shared memory:
We can use shared memory but it is very limited (not flexible).

Segments: multiple base-bounds pairs

address (for 32 bits)

Segment #: 8 bits

Segment offset: 24 bits

2^8 = 256 pairs of base and bounds
requires hardware supports.
each process has its own segment table.
segments have varying size (so we have holes between segments).

Using segments solve problems of

pre-determining how much memory will be needed
no absolute address in machine code
non-flexible shared memory.

However, segments still have problems of fragmentation problem.

Pages

Pages - like segments except all pages are same size (typically 4KiB)

virtual address (for 32 bits)

Page #: 20 bits

Page offset: 12 bits

We can grow domain by simply adding another pages.

Page table

page table: info about page

*Note that modifying page table is previlege instruction.
*Also system need to make sure physical address in page table does not point to address
that is allocated for page table because then user can modify page table without using previlege instruction.