CS111: Scribe Notes for May 22, 2013

CS111: Scribes Notes for May 22, 2013

Scribes: Derek Tsay and Abby Sywe

Contents

Non-atomic Write Problem

Proposed POSIX Model

Ideas for Atomic System Calls

Example File System/GPFS

Unreliable Processes & Bad Memory References

Base/Bounds Registers

Segments

Pages

Non-atomic Write Problem

What if we run "write(fd, buf, bufsize)" and a crash and reboot happens before the write finishes (bufsize = 1,000,000)? Recall that in the Berkeley file system, buf is put on disk as a series of blocks that are not necessarily continuous. Writing "ABCDE..." may appear as follows:

The problem with this is that the system can optimize the write, so it doesn't have to write from left to right. However, the problem gets worse. So far we are only talking about the contents of files, but what about directories? There are two possible scenarios of running "rename("a","b")":

In scenario 1, we write alpha, then we write beta. If a crash occurs between the writes, we lose the data for a. The result is this:

This result is equivalent to running "unlink("a")".

In scenario 2, we write beta, then we write alpha. If a crash occurs between the writes, the result is this:

This result is equivalent to running "unlink("b"); link("a", "b")", but we also get the wrong link count (we created a new inode for "b", but never incremented the link count). This issue can be fixed by FSCK, which will make sure the link count corresponds to the actual number of inodes after the system reboots.

POSIX is the standard for system calls, so what does POSIX say about this problem? Is there a standard way that application writers can deal with this?

Proposed POSIX Model

One proposed POSIX model to deal with this is to have in-RAM memory cache of regular files, file attributes, and directories. The units of regular files, file attributes, and directories are blocks, individual files, and directory entries respectively.

write(...) puts things in the cache, and the following system calls synchronize the content of the cache with what's actually on storage (they make changes durable):
fsync(fd) - this synchronizes data (data blocks) AND attributes (in inodes in Berkeley file system).
fdatasync(fd) - this synchronizes data ONLY. It is cheaper because it does not synchronize attributes

Remember, you must sync. Otherwise, you will lose data!

These synchronizing system calls are very slow because they make the disk arm move around. Professor Eggert edited Emacs such that it only calls sync when it is running in interactive mode to avoid redundant sync calls.

Using this POSIX model, screw-ups can still happen. We want to have atomic system calls.

Ideas for Atomic System Calls

1. COMMIT RECORDS (on a separate area of the file system)
The implementation of rename has to do several writes, and a commit record would say whether the rename happened completely or not. If the rename was incomplete, FSCK can clean up later.

We want some sort of begin and end statements such that the code surrounded by the begin and end statements is an all-or-nothing action. The all-or-nothing action would be divided into a pre-commit phase and a post-commit phase. During the pre-commit phase, we can back out of the all-or-nothing action, leaving no trace. During the post-commit phase, the action must finish. This is what we want to happen if we run rename("a", "b") with a commit record:
BEGIN
pre-commit phase (we can back out, leaving no trace)
COMMIT (assume atomic at the lower level)
post commit phase (can do clean up and optimization)
END

There are potential issues: if we instead run rename("a", "bbbbb"), we might run into trouble in the pre-commit phase. For example, there might not be enough disk space for the new name. If we cannot reach the commit point, we must back out, or abort. Therefore, we will rewrite the procedure above as:
BEGIN
pre-commit phase (we can back out, leaving no trace)
COMMIT or ABORT (assume atomic at the lower level)
post commit phase (can do clean up and optimization)
END

We can see that we used a similar technique like we did with blocking mutexes. We achieved high level atomic operations using low level atomic operations. Another idea for atomic system calls is to use journals.

2. JOURNALS

The grid below represents "cell data," where each cell is a block. Note that cell 13 contains the data "A" and cell 37 contains the data "B".

The journal can only be appended to (assume it is infinitely long), so old versions of data are not overwritten. It provides storage and random access for reads.

One way to use the journal is to:
1. Log the change you want to make
2. Log the commit
3. Write the cell data
4. Log the end

If we want to change the contents of cell 13 to "B" and the contents of cell 37 to "0" using a WRITE-AHEAD LOG, the steps would be:

BEGIN
13 = "B"
37 = "0"
COMMIT (make changes to cells)
END

The most delicate place to crash would be between COMMIT and END, but this is not an issue because if the system crashes, FSCK will see that we committed but did not finish doing the rename (end wasn't logged). Since we are using a write-ahead log, the recovery procedure replays the log from left to right, and tries to do the rename again.

One problem that could happen is a crash during FSCK. To solve this, FSCK must be written carefully, and it should recover using idempotent operations (it can restart from the beginning with the same ultimate result). OR, we can use the journal again for FSCK. If another crash happens during the FSCK, FSCK will operate on the previous FSCK (we have nested FSCKs).

Another problem with this approach is that there is a performance hit for doing all this work. A radical proposal to solve this problem is to use a log structured file system. We put the journal on the disk, but cell data is cached in RAM. The benefits of this approach is that writes are fast because no seeks are involved. However, reads can take a while if data is not cached.

We can also use a write-behind log instead of a write-ahead log. In this case, we would store the previous values of the cells, and the steps would be:
BEGIN
13 = "A"
37 = "B"
now make changes to cells
COMMIT
END

If a crash occurs, FSCK replays the log backwards (it reverts any changes). The benefits of using a write-behind log is that it is faster than write-ahead logs. However, it is more expensive to grab old data out of the cells.

Example File System/GPFS

As an example, we can look at a research file system that was developed about a year ago. The size of the system is 120PB (the equivalent of 120,000 terabytes, or 120 million GB). The system has 200,000 hard drives, each of size 600 GB, which allows for faster drives that require less power (leading to better performance).

This system is an example of a General Parallel File System (GPFS). In a GPFS, drives are scattered over the network, instead of all being connected to a single bus or handler. File metadata is also distributed throughout the system, enabling parallel access and thus better performance. Locks are distributed throughout the system too, and directories are efficiently indexed. Additionally, the system stays alive during the fsck function.

A key feature of GPFS is partition awareness. If a router fails in a network, the disk becomes partitioned, and partitions can't communicate with each other. GPFPS automatically discovers the partitions. The largest partition survives, while the smaller one reports errors.

Unreliable Processes & Bad Memory References

A common assignment statement follows: a[i] = 27;.

If the value of i is out of range, then a[i] refers to some random spot in memory. The assignment statement is possibly replacing useful information or data.

What are some potential solutions to this problem?

Hire better programmers (doesn't really work).

Make the software check
Ex. Have the compiler check the subscript range.

This slows down performance, though.

Get help from the hardware.

Ex. Add subscript instruction to hardware.

Problem: This ties up the registers, which results in worsened performance.

None of these approaches really solve the problem. An alternative approach is:

Base/Bounds Registers

General idea: put limits on RAM memory that a process is allowed to access by adding 2 registers.

Base register: holds address of 1st byte of allowed memory

Bounds register: holds address of 1st byte of forbidden memory

The hardware checks all memory accesses to ensure only the allowed memory region is being accessed. If this check fails, it should be treated like the execution of an invalid instruction. The process traps, and the kernel can decide what to do from there.

With this approach, there is no need for another instruction to deal with the subscript. However, this approach has several problems:

The amount of memory needed must be known in advanced. Thus, processes can't grow.

The RAM memory is susceptible to fragmentation.

Machine code must be position independent, so absolute addresses aren't allowed. For example, say we want to run 2 copies of sort:
sort a > aout
sort b > bout

We need a separate copy for each instance. Each copy has its own base and bounds registers.

If a program has an instruction like "jmp 10000", one of the programs is wrong, and we should trap.

Aside: sometimes, we do want shared memory.

We can extend and improve the base/bounds register idea using segments and pages.

Segments

General idea: A process has multiple base-bounds pairs, one per segment. Each process has a segment table (which requires hardware support).

If the process needs to grow, simply ask OS for another segment. Fragmentation is still a problem, since segments are of varying sizes. Thus, we might not be able to allocate a large chunk of memory, even if there is enough total free memory.

Pages

Pages are similar to segments, with the main difference being that all pages in a system are of the same size. Processes have a page table, which is indexed by page number. To access a memory address, go to the physical memory, then the fixed range, then the offset to get the word you want.

If a process needs to grow, ask OS for another page. Fragmentation is no longer an issue, since all the pages are the same size. However, there are potential vulnerability issues:

Process can "cheat" by changing value in base register. To address this problem, we can make this particular operation a privileged instruction, so only the kernel can modify the value. The kernel should split up the page table, so a process never has a page table that points to itself. The page table corresponds to other processes.

The page table is stored in register %cr3. A hacker might change %cr3 to change the page table. To address this, we should make storing values in %cr3 a privileged instruction as well.