CS 111 Lecture 14

Notes By Jun Zhou

Continued from last time...Bug in gzip:

http://bugs.gnu.org/22768

gzip foo
fd = open("foo.gz", O_CREAT ...);
write(...);
close(fd);
unlink("foo");
<--- If the program crashes here, the system might have written out the block associated with unlink() without writing out the first write().

Fix:

dfd = open(".", O_SEARCH);
fd = openat(dfd, "foo.gz", O_CREAT | ...);
write(...);
fsync(fd); //saves write, slow
fdatasync(dfd); //saves file creation
close(fd);
unlinkat(dfd, "foo");

run with: gzip --synchronous

Argument against this fix:

The bug occurs with any app that moves (not copies) data, so we need to fix not only gzip but also mv, etc.
mv --synchronous

Instead of fixing gzip, this is a problem with the file system/scheduler.

Sidenote:

2 big ideas for "reliable" file systems:

Commit record

a. assume individual blocks/sectors can be written atomically (1 sector atomicity)

b. then collect several writes s1, s2, ...sn, sn+1 where sn+1 is the commit record. Previous writes are tentative until sn+1 happens, allowing multiple sector atomicity
Journal

Divide the file system into 2 pieces: cell memory and log. Cell memory is an array of blocks of the file system. The log is an append-only tape recording the state of the file system.

For any big action:
Begin
log intended changes (pre-commit)
Commit/Abort
copy changes from log to cell data (post-commit)
End

Writes put commit record into log.
Example log: beg | a: A | b: B | ok | end
Copy from RAM cache to cell memory only after 'ok'. This ensures atomicity. If the program crashes, then during reboot, check the log to see if any action has a 'beg' and 'ok' but no end. Replay that action.

Problems:

The log is a circular buffer. What if we run out of space on the log?
One option: if the log is full, start writing out to cell memory.
A more extreme approach: no cell data. Use a file system with all log and no data. During crash and reboot, go through the entire log and build a model of file system from scratch. Once the log is full, remove obsolete entries.
=> This method is efficient for systems and apps with lots of small writes because it keeps all writes in the same spot, requiring no seeks, just writes
Recovery involves a scan back to see how far back to replay actions, then a scan forward to replay. What we've discussed previously is called a 'write-ahead log'. A possible solution is to use a 'write-behind log':
Begin
log old values
new values into cell data
Commit
End
=> This enables faster recovery because you just need to restore old values. Unreplaying the log is now 1 pass instead of 2. Also, there are fewer sectors to write because commit and end are the same, so we do not require an extra 'end'.
Problem: to get old values, we need to read from disk, which can be a waste of effort. If we need the old data anyway, then this is okay. Otherwise, if we are just writing, then the old data is not needed and this is a waste.
=> In general, a write-ahead log is better for file systems, write-behind is better for database systems.

A More Complex Idea:

Atomicity becomes a problem when actions are big enough. A more complex idea is to let users see inside pending result and see which phase of action has been finished

2 Techniques:

Cascading aborts: Caller should fail if callee doesn't commit. Apply this at multiple levels.
Compensating action: Caller fixes low-level problem some other way and then continues.

When does each technique work better?
Begin
cascading abort -----> (callee's action failed, caller should also report failure)
Commit
compensating action------> (commit already done, no need to abort, just use compensating action)
End

Also, recovery should be robust in the presence of crashes. One way is to make it 'idempotent':
recover(recover(x)) = recover(x)

Can we partition data into "must be persistent" vs "ok to lose"?

Yes, on the same machine, use multiple filesystems, run 1 in safe mode and the other not.
Q: how to support namei in this case?
Ex: namei /usr/bin/sh
runs through file system and looks for inode with name "bin"
Problem:
/usr/bin/sh is in 1 filesystem (ext4)
/home/eggert/catvideos is in another filesystem (vfat)
A. In Linux, there is another data struct called a 'mount table' which maps (fs, inode #) -> fs (dev_t, ino_t) dev_t
So namei actually uses (dev, inode #) and checks mount table to verify that fs matches for inode #
=> object oriented, where fs type determines which method to use for read, write, etc.

...Next Problem:

3. unreliable programs with invalid memory references

Solutions:

Hire perfect programmers
Use java, python, js (software checks), but this is slow
Hardware help

Hardware Help:

Use 'base and bounds' (which are privileged registers marking start and end of your program in memory).
Every memory access must check that it is within base and bound of the program.
Problem: fragmentation and inflexibility, because this requires that program memory be contiguous
Segmentation
Divide memory address into 2 pieces: seg# and offset within segment.
Ex: 8bit seg# and 24bit offset for 32 bit address. Segment table maps seg# to base and bounds.
Or, divide data into different areas such as stack segment, text segment (for code), etc. This makes it possible to grow the stack or move, then adjust base and bound values.
Problem: there can still be fragmentation
Solution...
Segment table with base only and no bound (insist that every segment is the same size)
=> Basically, use pages with 'page based memory' where a virtual address contains a page number and offset byte# into the page.

http://denninginstitute.com/itcore/virtualmemory/images/Translation.gif

This brings us to the topic of virtual memory...

Uses for Virtual Memory:

Run programs that need 16 GiB on computers with only 8-GiB RAM.
How? If accessing a part of VM not in physical memory, page fault and bring in data from the disk. But this is very slow if memory accesses are random, okay if contiguous only. In general, this is not a good reason to use VM.
Programs can share memory safely
Ex: syscall sendmsg(dest proc, message) where message is a ptr to a buffer
This can be implemented by just copying over a ptr to the actual message data from one process's page table to another's
=> safe and efficient interprocess communication, because processes can only look at data their page table points to
Ex: malloc can be implemented via a call to mmap, a syscall by which a process can control its own page table. The app asks the kernel to modify its page table.
Args:
1. fd to mmap from
2. offset in file
3. size of region
4. VM address //where you want the section to appear in VM
Advantages: can allocate just enough for 293617000, no fragmentation

The page table can be seen as a cache of swap space (the pages your program has, on disk).
Q: With 8 GiB RAM, how much swap space should you have? If swap space greatly exceeds RAM, you can run programs a lot bigger than RAM but this is slow.
A: next lecture...