More File System Robustness and Virtual Memory

By: Keith Corbalis, UID: 804-009-333

1. Power Loss with rename()

The main issue we are going to be looking at is power loss. If we are to write to a buffer, what would happen if the system crashes, and then reboots?
write(fd, buf, bufsize);
Imagine we have a buffer size of 10,000,000 bytes (about 10 MBs). Since this is a very large buffer, there is potential for a fault when the system crashes and reboots.

In the Berkeley File System, the blocks that the file system writes to aren't necessarily contiguous blocks (file's contents aren't always stored in blocks in left to right order). Therefore, when the system reboots in the middle of a write, we cannot simply examine a contiguous amount of blocks left to right to see what blocks weren't written. We need to find another way.

We will examine this issue through a case study of the rename("a", "b") function. It happens in 3 steps:
	Step 1: Create file "b" in working directory, the new file.
	-----------------------------------------
	|       |"a" 27 |       |"b" 65 |       |
	-----------------------------------------

	Step 2: Set file "b"s inode to "a", remove "a".
	-----------------------------------------
	|       |   0   |       |"b" 27 |       |
	-----------------------------------------
	
	Step 3: Move file "b" to location where "a" was at.
	-----------------------------------------
	|       |"b" 27 |       |   0   |       |
	-----------------------------------------
		
This gives us two potential problems in crashes. If our system crashes in between step 1 and step 2, our file system may look like this:
	-----------------------------------------
	|       |   0   |       |"b" 65 |       |
	-----------------------------------------
		
Which is equivalent to calling the command unlink("a"). This isn't what we want, as now we have lost the block number that "a"s contents correspond to, and therefore we cannot rename file "b" to file "a".

Another possibility of what our file system may look like after a crash between step 1 and step 2 is this:
	-----------------------------------------
	|       |"a" 27 |       |"b" 27 |       |
	-----------------------------------------
		
Which is equivalent to calling the commands unlink("b"), and then link("a", "b"). This is the wrong link count, as we have one more file name than we wanted. This is where the system call fsck() will come into play.

2. fsck() and fsync()

In our proposed POSIX model of a file system, we have a certain amount of RAM that consists of a cache. In this cache, there are three different entities stored:
  1. Regular files: blocks of a given status.
  2. File attributes: individual files.
  3. Directories: its units consist of directory entries.
    • When you create a file, directory entry is created.
The function fsck() stands for file system check, in which the system will block and check and repair the file system, seeing if there were any crashes that causes an inconsistency in the file system. Instances in which fsck() will be called will be given towards the end of section 3.

When one calls the write function, the system writes to the cache, rather than the disk. In order to write the items in cache to disk, the following two functions are used:
fsync(int fd)
fdatasync(int fd)
In fsync, the function does not return until the contents from the file 'fd' on the cache are permanently stored on disk. This includes both the file's data, as well as the file's attributes.

In fdatasync, the function is very similar to fsync, although it is much cheaper, as it takes less time to execute completely. It only synchronizes the data of the file, not the attributes.

The attributes are kept in the inodes of the filesystem, while data is kept in the data blocks. So fsync copies information in the inodes and data blocks, while fdatasync copies information only in the data blocks.

In the Berkeley File System, all of the inode files are squished into blocks. In the POSIX model, inodes are individual files. An ideal model is to standardize how all file systems work, so that way we can fsync directories if we want to save our files properly. That way, they will be there even with a crash. Because all file systems are not standardized, fsync() and fdatasync() are kind of slow, as a result.

Moving back to our original issue, we have a rename() function that can turn into an unlink, or an unlink followed by a link during a crash. What we want is an atomic rename, so these errors do not occur when our system crashes due to a power failure.

3. An Atomic Rename

There are a couple of ideas to make an atomic rename.
  1. Commit Record -- separate method, marks whether an atomic action was performed or not
    • The idea is when you implement a rename, it has to do several writes.
    • The Commit record says whether that write happened or not
    	rename("a","b")
    	BEGIN
    		pre-commit phase, (can back out of this, leaving no traces of writing) [invisible]
    	COMMIT (atomic) or ABORT (atomic) <---- low level atomic write of one block
    		post-commit phase, (clean up operations, optimization, etc) - update the hash table (directories to inodes) [invisible]
    	END
    						
  2. Journaling -- append-only file system
    • "cell data" -- each cell is a block
    • The simplest journaling model: infinitely long append-only storage (for reads), random access (for reads)
    • An example of journaling:
    	BEGIN -- log BEGIN in journal
    		log the changes you want to make
    	13 = "B"
    	37 = "0"
    	COMMIT -- write down that changes were made
    	WRITE -- write the data to disk
    	END -- log END in journal
    						
    • Journaling is exactly what its title is, writing a journal about every step taken during the write process. This way, when the system crashes and reboots, it can then see where it left off. Its recovery procedure is reading from left to right. But, what if we combined these two?

  3. Hybrid System -- this is a log structured file system, with commit records
    • The idea is to keep the journal on disk, while putting the cell data cache in RAM. Because we aren't actually writing the cell data to the disk, writes are very fast because there's no seeking! However reads can take a while if they are read out of the cache.
    • There are two types of logs:
      • Write-behind log: when fsck() is called, it replays the log backwards.
      • Write-ahead log: when fsck() is called, it replays the log forwards.
An example of a research file system:
	120 Petabytes		200,000 harddrives (600GBs each) <--- results in faster drives requiring less power per drive
	GPFS				General Parallel File System
	Has the following properties:
		Distributed metadata (results in good performance, unlike HDFS [Hadoop Distrubuted File System])
		Efficient directory indexing
		Distrubuted locking
		Partition awareness
		System stays live during fsck()
		Consists of 2 partitions, one stays live, the other reports errors.
		

4. Unreliable Processes with Bad Memory References

Our next problem we are going to look at, besides power failure, are unreliable processes that have bad memory references. An example of this is accessing an index in an array that doesn't exist. For example, if we have an array of size 5, the only accessible indexes are array[0], array[1], array[2], array[3], and array[4]. If we try to access array[5], our program will likely fail with an error of "Segmentation fault", or "SIGSEGV".

Some solutions:
  • Hire better programmers.
    • Not a great solution, we all make errors, though some programmers are better than others.
  • Tell the compiler to insert subscript checks, e.g. Java Virtual Machine.
    • This will give us performance problems and take a long time.
  • Get help from hardware.
    • Add a SUBSCRIPT option to the hardware, but even this is too slow.
The idea of checking for bad memory references is to have a base and a bounds register in the hardware
       |                                               |
       v out of bounds                                 v out of bounds
    -----------------------------------------------------------
    \\\\\\\\|                                       |\\\\\\\\\\				
    -----------------------------------------------------------
            ^ base                                  ^ bounds
		
Downsides:
  • You must know in advance how much memory you need, therefore you can't grow a process. You can shrink it by decrementing the bounds, however there is no way for the user to know if the memory before the base or after the bounds is currently being used by another process, or the operating system.
  • You get fragmentation through this.
  • No absolute addresses grow a process in machine code, must use position independent (PIC).
  • Sometimes we want shared memory, which would not work in this case if we are checking if the memory is within a certain boundary.
Segments:
  • Multiple base-bound pairs, 1/segment
  • Consists of an 8 bit segment number, with a 24 bit offset to the next segment
  • This requires hardware support
  • Each process has its own segment table
  • Segments have varying sizes
We will pick up the next lecture on segments