CS 111
Scribe Notes for 5/21/13
by Kevin Takata and Ashish Lal
File System Robustness (continued)
Unaddressed Issues with File System Robustness
Large Writes
write (fd, buf, bufsize)
where bufsize = 10000000 (very large)
For large writes, the kernel has to write block by block. Thus, large writes can have undefined behavior if a crash and reboot occurs somewhere in the middle of the write.
For the berkeley file system, blocks are scattered in random order; therefore, for efficiency, the kernel will do writes out of order in order to exploit locality of reference of the disk arm.
In the case above, if the disk needle is pointed to the block allocated where EFGH resides, it would write that to disk first followed by ABCD followed by IJKL. However, a crash and reboot in the middle of this write would cause undefined behavior where it would seem that some parts of the write worked and some parts didn't.
Directories
Similarly, a crash in the middle of a directory rename could result in two different cases.
Before:
After (I):
When a crash occurs between writing α and writing β, you will be left with the old copy of "b" and you will have lost the old "a" resulting in:
≡unlink("a")
In the other potential case,
After (II):
When a crash occurs between writing α and writing β, you will be left with the new copy of "b" and you will have lost the old "b" resulting in:
In the two cases above, the first one is disastrous because the file is lost. However, in the second case, although there will be leaked memory, it is not disastrous because it can be cleaned up when the system has been rebooted and fsck is run.
Proposed POSIX Model
Types of files |
Units |
Regular files |
blocks of a given size statvfs outputs |
File attributes |
Individual files |
Directories |
Directory entries |
System Calls:
-
fsync(fd)
which synchronizes data (data blocks) +atrributes (inode); Synchronizes the contents of a cache with what's on permanent storage (disk); Expensive, slower
-
fdatasync(fd)
which just synchronizes data (data blocks); Cheaper and faster
When writing using the write(...) function, there is no guarantee that it will be written to disk. It's written to cache first. In order to guarantee the the data to be written to disk, you must use
fsync(fd)
or
fdatasync(fd)
. However, fsync is
Solutions to Make Atomic System Calls
Idea #1: Commit Record
- Separate spot on the file system
- Rename consists of multiple writes; we either want all of these writes to happen or none of them
- Commit records will record the progress of which writes are completed
- On reboot, the system will check the commit record and see if anything else needs finishing
- Implementation of commit records for rename:
rename (a,bbbbbbb)
BEGIN
pre-commit phase (can back out, leaving no trace)
COMMIT(atomic) OR ABORT(atomic) //Low level atomic write of one block; ABORT: Because something in pre-commit phase like no disk space, too much in directory, etc
post-commit phase //Cleanup, optimization
END
Idea #2: Journal
- Attributes of the simplest model:
- Infinitely long
- Append only storage (for writes)
- Random access (for reads)
- Write-ahead journal process
Radical Proposal: (Log Structured File System)
- Keep entire journal on disk
- Put cell (block) data in RAM
- Pro: Writes are fast because there's no seeking
- Con: Reads can take a while (if data desired is out of cache)
- Another way to use journal: Write-behind journal process
Example Research File System:
- Reasoning for parallel file systems: in the case of 120 PB storage array with 200,000 hard-drives (each containing 600 GB), we want a file system that exploits parallel reading and writing from the storage array
- GPFS (General Parallel File System) -- A glimpse of how future file systems will probably work
- Distributed metadata (good performance unlike HDFS)
- Efficient directory indexing
- Distributed locking
- Partition Awareness
- System stays live during FSCK
Virtual Memory
Problem: Unreliable processes that have bad memory references
Example:
a[i] = 27; //but i is out of range
Solutions:
- Hire better programmers (doesn't work we're human)
- Tell compiler to insert subscript checks (ex. Java Virtual Machine)
- However, this has performance problems because the system has to check each time you dereference an array that it's within its bounds
- Get help from hardware
- Add SUBSCRIPT instruction to hardware (Performance problem: This instruction would require occupying registers that could be used by other processes)
- Add base and bounds register to hardware
Base and Bounds
The hardware checks that all memory access should fit between the base and the bound. If the check fails, the hardware traps. The kernel will then take control over the unruly process and deal with it.
Problems:
Segments
- Multiple base/bound pairs where one pair is allotted for each segment
- Sections of discontiguous memory used for a single process. Although they deal with fragmentation better than base/bound pairs alone, the problem still exists due to the varying sizes of segments.
- Provide flexibility in allowing a process to grow or shrink the amount of memory allocated to it.
- Solves all of the issues listed above with base/bound pairs other than the potential fragmentation issue with varying sizes of segments.
- Requires hardware support
- Each process has its own segment table in order to reference the addresses within each segment
Segment Number |
Offset in segment |
- In order to a reference an address in a process using segments, 32-bit integer will have the upper 8 bits be used to reference the number of the segment and the lower 24 bits will be the offset inside the segment.
Pages
- In order to address the issue of fragmentation with segments, pages must be utilized.
- Have all the same properties of segments with the exception that they are all the same size; usually, 4 KiB.
Page Number |
Offset in page |
Because we know that each page is only 4 KiB big, we only need 12 bits to reference all of the memory in that page
- In order to grow the domain of a process, you simply allocate and add pages to that process's page table.
- Page tables: Provide information about all of the pages that a process contains
Page number |
Starting Address |
0 |
0x28493934 |
1 |
0x9a8b3f7b |
... |
... |
n-1 |
0x23674356 |
- A page table can never have a starting address that points at itself in order to avoid exploitation
- Changing anything in a page table requires a privileged instruction