CS 111

Scribe Notes for 5/21/13

by Kevin Takata and Ashish Lal

File System Robustness (continued)

Unaddressed Issues with File System Robustness

Large Writes

write (fd, buf, bufsize) where bufsize = 10000000 (very large)

For large writes, the kernel has to write block by block. Thus, large writes can have undefined behavior if a crash and reboot occurs somewhere in the middle of the write. For the berkeley file system, blocks are scattered in random order; therefore, for efficiency, the kernel will do writes out of order in order to exploit locality of reference of the disk arm.

EFGH

ABCD

IJKL

In the case above, if the disk needle is pointed to the block allocated where EFGH resides, it would write that to disk first followed by ABCD followed by IJKL. However, a crash and reboot in the middle of this write would cause undefined behavior where it would seem that some parts of the write worked and some parts didn't.

Directories

Similarly, a crash in the middle of a directory rename could result in two different cases.

Before:

"a" 27

"b" 65

After (I):

"b" 27

When a crash occurs between writing α and writing β, you will be left with the old copy of "b" and you will have lost the old "a" resulting in:

"b" 65

≡unlink("a")

In the other potential case, After (II):

"b" 27

When a crash occurs between writing α and writing β, you will be left with the new copy of "b" and you will have lost the old "b" resulting in:

"a" 27

"b" 27

In the two cases above, the first one is disastrous because the file is lost. However, in the second case, although there will be leaked memory, it is not disastrous because it can be cleaned up when the system has been rebooted and fsck is run.

Proposed POSIX Model

Types of files	Units
Regular files	blocks of a given size statvfs outputs
File attributes	Individual files
Directories	Directory entries

System Calls:

fsync(fd) which synchronizes data (data blocks) +atrributes (inode); Synchronizes the contents of a cache with what's on permanent storage (disk); Expensive, slower
fdatasync(fd) which just synchronizes data (data blocks); Cheaper and faster

When writing using the write(...) function, there is no guarantee that it will be written to disk. It's written to cache first. In order to guarantee the the data to be written to disk, you must use fsync(fd) or fdatasync(fd). However, fsync is

Solutions to Make Atomic System Calls

Idea #1: Commit Record

Separate spot on the file system
Rename consists of multiple writes; we either want all of these writes to happen or none of them
Commit records will record the progress of which writes are completed
On reboot, the system will check the commit record and see if anything else needs finishing

Implementation of commit records for rename:


      rename (a,bbbbbbb)
        BEGIN
          pre-commit phase (can back out, leaving no trace)
        COMMIT(atomic) OR ABORT(atomic)   //Low level atomic write of one block; ABORT: Because something in pre-commit phase like no disk space, too much in directory, etc
          post-commit phase               //Cleanup, optimization
        END

Idea #2: Journal

Attributes of the simplest model:
- Infinitely long
- Append only storage (for writes)
- Random access (for reads)
Write-ahead journal process
- Journal process
  1. Log the change you want to make
  2. Log COMMIT
  3. Do writes for system call
  4. Log END
```
            BEGIN               //Step 1: Log the change
            13 = "B"
            37 = "0"
            COMMIT              //Step 2: Log commit
            ... (make-changes)  //Step 3: Write data to disk (system call data)
            END                 //Step 4: Log END
          
```
- Recovery Procedure (FSCK): Replay log left to right (replays forward)
  - If a crash occurs during FSCK, then FSCK should recover using idempotent operations (where if you run an operation twice or more, then it'll have the same result)

Radical Proposal: (Log Structured File System)

Keep entire journal on disk
Put cell (block) data in RAM
- Pro: Writes are fast because there's no seeking
- Con: Reads can take a while (if data desired is out of cache)

Another way to use journal: Write-behind journal process

Journal process

Log the change you want to make
Do writes for system call
Log COMMIT
Log END


            BEGIN               //Step 1: Log the change
            13: "B"
            37: "0"
            ... (make-changes)  //Step 2: Write data to disk (system call data)
            COMMIT              //Step 3: Log commit
            END                 //Step 4: Log END

Recovery Procedure (FSCK): Replay log right to left (replays backward)

Example Research File System:

Reasoning for parallel file systems: in the case of 120 PB storage array with 200,000 hard-drives (each containing 600 GB), we want a file system that exploits parallel reading and writing from the storage array
GPFS (General Parallel File System) -- A glimpse of how future file systems will probably work
- Distributed metadata (good performance unlike HDFS)
- Efficient directory indexing
- Distributed locking
- Partition Awareness
- System stays live during FSCK

Virtual Memory

Problem: Unreliable processes that have bad memory references
Example: a[i] = 27; //but i is out of range

Solutions:

Hire better programmers (doesn't work we're human)
Tell compiler to insert subscript checks (ex. Java Virtual Machine)
- However, this has performance problems because the system has to check each time you dereference an array that it's within its bounds
Get help from hardware
- Add SUBSCRIPT instruction to hardware (Performance problem: This instruction would require occupying registers that could be used by other processes)
- Add base and bounds register to hardware

Base and Bounds

//////////

base

bounds

The hardware checks that all memory access should fit between the base and the bound. If the check fails, the hardware traps. The kernel will then take control over the unruly process and deal with it. Problems:

You must know in advance how much memory you need (you can't grow a process, can only shrink one)
Fragmentation
No absolute addresses in machine code; must use position independent code (PIC)
```
      sort a > aout
      sort b > bout
    
```
10000 ↓

////////// Copy of 'sort' ////////// Copy of 'sort' //////////

"sort a" "sort b"
If both copies of sort have an instruction "jmp 10000", only the one on the left will be valid because the one on the right will be referencing something outside its base/bound pairs and will cause a hardware trap.
Sometimes we want shared memory

One possible implementation of base/bound pairs with shared memory however it may be difficult to implement and manage.

//////////////////////////////////////////

base 1 base 2
shared
bound 1 bound 2

Segments

Multiple base/bound pairs where one pair is allotted for each segment
Sections of discontiguous memory used for a single process. Although they deal with fragmentation better than base/bound pairs alone, the problem still exists due to the varying sizes of segments.
Provide flexibility in allowing a process to grow or shrink the amount of memory allocated to it.
Solves all of the issues listed above with base/bound pairs other than the potential fragmentation issue with varying sizes of segments.
Requires hardware support
Each process has its own segment table in order to reference the addresses within each segment

← 8 → ← 24 →

Segment Number Offset in segment
In order to a reference an address in a process using segments, 32-bit integer will have the upper 8 bits be used to reference the number of the segment and the lower 24 bits will be the offset inside the segment.

Pages

In order to address the issue of fragmentation with segments, pages must be utilized.
Have all the same properties of segments with the exception that they are all the same size; usually, 4 KiB.

← 20 → ← 12 →

Page Number Offset in page

Because we know that each page is only 4 KiB big, we only need 12 bits to reference all of the memory in that page
In order to grow the domain of a process, you simply allocate and add pages to that process's page table.

Page tables: Provide information about all of the pages that a process contains

Page number	Starting Address
0	0x28493934
1	0x9a8b3f7b
...	...
n-1	0x23674356

A page table can never have a starting address that points at itself in order to avoid exploitation
Changing anything in a page table requires a privileged instruction