CS 111 Winter 2016

Lecture 13

Scribe Notes for 2/24/16

By Aaron Chung & Nikhil Shridhar


What's the Worst That Can Happen in a Filesystem?


Bug Report for GZIP

gzip: software application used for compression and decompression of data files

A pretty simple program that has been maintained for decades.
input: $gzip foo
output: foo.gz
Now let's take a look at what system calls gzip uses:


    if (close(ofd) != 0)
        error();
    if (close(ifd) != 0)
        error();
    if (unlink("foo") != 0)
        error();
        
	//bug report says that if system crashes here, gzip could lose all of your data

    exit(0);
    
If you look at all the system calls, we were very careful, making sure to close the output file. So how is it that this program can lose data?

Why?

This bug is so common that if you look at random applications running on linux, there is a good chance that they will have this same bug (waiting to lose all your data)

The problem is that when the system processes these write requests, it might change the order for efficiency reasons. By reordering the code, this creates the possibility for memory leaks and data loss. In the following sections we describe why these problems appear.

Solution:

if (fsync(ofd) != 0)
	error(0);

Calling fsync causes the program to retain the original request ordering. While this solves the data loss issue, it slows performance significantly.






BSD Filesystem Layout Diagram

Boot Sector

This region contains machine code to be loaded into RAM by the computer's built in firmware

Superblock

This region contains a recod of the characteristics of a filesystem includeing its size, the block size, the empty and filled blocks and their respective counts, the size and location of the inode tables, the disk block map and usage information, and the size of the block groups

Inode Table

This region contains machine code to be loaded into RAM by the computer's built in firmware

Free Block Bitmap

This region keeps track of where free blocks are in the data blocks section

Inode Table

Each entry in the Inode Table corresponds to one file in the file system

Data Blocks

This region stores the data in each file



Overview of Performance Issues With File Systems

Suppose you wanted to get a block of data from a hard disk

There is an overhead involved with reading blocks from disk:

Overhead = seek time + rotational latency + transfer from disk to cache + transfer cache to RAM = ~18ms

18ms = ~18 million instructions

This means that in the same amount of time that it would take you to get one block from disk, you could do 18 million useful units of work. You would probably prefer to do the 18 million instructions.


Performance Metrics for I/O

Utilization

     Fraction of capacity of disk drive doing useful work

Throughput

     The rate of request completion

Latency

     Overall delay between request and response

Increasing latency will typically decrease throughput and utilization. Conversely, increasing throughput and utilization will decrease latency. These are known as competing objectives



How to Improve File System Performance

Batching

If you have a bunch of individual writes to do, don’t do them one at a time. Instead collect them all together into one big write and then do that big write all at once.

This is done at all levels of file systems in the OS.

Classic example of batching is putchar()

while((c=getchar()) != EOf){
     putchar(c);
     }

This code would be really inefficient

Instead, the stdio library turns getchar and putchar into read(0, buf, 4096) and write(1, buf, amount_to_write)


Speculation

File system will guess what your program will do in the near future and act on this guess

For example, when the system turns getchar() into read 4000 bytes, it is guessing that you will want to keep reading. The system guesses what you will want to do and then actually does some of the work before you explicitly ask for it.

Why does this work?


We can think of our disk access code as a system of layers

Block Layer

Block Layer Diagram

The block layer knows nothing about the bitmap and inodes. All it knows is the size of the file. It communicates with the rest of the file system using only requests (read and write)

The block layer maintains a cache. If higher levels of the system exhibit locality of reference, a request to read a given block can be very fast.


How can we make file system performance even faster (specifically in regards to the block layer code)


Simple Model of Hard Disk Access

Hard Disk Model

Model the disk as a large array of blocks

At any point in time, the IO head is pointing to some block on disk

To go from the current block h to some other block i, the cost is proportional to the distance |h-i|

Since seek time dominates the overhead time of reading blocks, this model is useful

INTERESTING QUESTION: What is the average cost of seeking a random block to read next?

Average Case Read Cost Equation

We can calculate the average cost of finding the next block as follows:

As it turns out, the resulting value of the above integral is: Average Cost = 1/3



"Queue" of pending requests:

Requests are stored for a period of time until the block layer decides which to do next. The order doesn't necessarily have to be the order in which the requests appeared.

There are a few different algorithms which are used to choose the next request to execute:



Algorithms for Determining Which Request to Satisfy Next

Shortest Seek Time First (SSTF)

If you have a lot of work to do, pick the request that is closest to the current position. This is a greedy algorithm. It does not always guarantee the minimal cost solution. In addition, it is unfair in the sense that it can lead to starvation. If new requests continue to appear that are closer to the current position than some older request, the older request will never get executed.

There are a few modifications which will take care of the issue of starvation.

Elevator Algorithm

This algorithm takes the shortest seek time in the positive direction until you either reach the end or complete all pending requests. This algorithm does not cause starvation. However, it is not entirely fair because it doesn't satisfy requests in the order of issuance. Also, requests to read or write at either end of the block layer will have longer average wait times than requests in the middle of the block layer.

Circular Elevator Algorithm

This algorithm is essentially the same as the original elevator algorithm except that you ALWAYS seek in the positive direction. Once the end is reached, simply return the IO head to the beginning of the block layer. While this is added work for the block head, it makes the overall algorithm more fair in that all requests will have the same average wait time, regardless of their position in the block layer.

SSTF + FCFS

This algorithm gives each request a score based on how long each it has been waiting and the request's distance from the IO head. This is essentially a greedy algorithm without starvation. No matter how far away a request is from the IO head, eventually it will have waited long enough to score first in the wait queue.



What Can Go Wrong When Renaming a File?

Example: rename("foo", "fuzzy")

Look in the working directory inode to find where foo is on disk. This directory will point to the data blocks containing foo's contents. We want to overwrite the directory data to indicate file name: "fuzzy" instead of "foo" and file name length: 5 instead of 3. This write request will be sitting as a pending request in the wait queue.

What Can Go Wrong Here?

Suppose the directory grows. What would we need to do?

  1. Find a free block to allocate by looking at the free block bitmap and mark it as used. (The block bitmap is cached in RAM)
  2. Since the directory has grown, the directory data now has 2 blocks in it instead of just one. We need to write out the appropriate part in the inode table for the working directory. (This is probably also cached in RAM)
  3. mark the old directory entry block as free since we are no longer using it

So in what should just be a simple rename operation, instead of just rewriting a block in place, we might have to write several different blocks just to make it happen.

Now lets consider what order we should do those writes in.

Why does order matter?

Order to Write

  1. Write the new file name and file name length to the free block found in the free block bitmap (If the program crashes here, the user would not be aware of what happened and nothing will be affected)
  2. Write to the free block bitmap, indicating that the block just written to has been taken (If the program crashes here, this will cause a memory leak
  3. Overwrite the inode block (If the program crashes here, we will have 2 different names for the same file (essentially a hard link) In this case, the user has not lost any data. However, we haven't updated the link count. If we happen to remove the file, then we would have a dangling pointer and the system would be unaware that it has a directory entry pointing to a file that doesn't exist
  4. 4. Overwrite the directory block (If the program crashes here, it doesn't matter. The rename request has been completely fulfilled already


File System Invariants

  1. Every Block is used for exactly one purpose
  2. All referenced blocks are appropriately initialized. (Free blocks don't have to be initialized because they are not referenced)
  3. All referenced data blocks are marked as used in the bitmap
  4. All unreferenced data blocks are marked free in bitmap

Penalties for Violating These Invariants

  1. Suppose you use the same block for 2 differet purposes. When you overwrite one block, you also overwrite the other. This can cause data loss.
  2. Suppose you reference an improperly initialized block. This might cause you to look at files you aren't supposed to.
  3. Suppose the referenced data blocks aren't marked as used in the free block bitmap. The program would think these blocks are free and might overwrite them causing you to lose your data.
  4. Suppose an unreferenced block is not marked free in the free block bitmap. This block wouldn't be able to be reused, causing a memory leak.

How to Solve the Issue of Storage Leaks

fd = opendir("/usr"); //this is the equivalent of "find". It finds all of the files in the file system

opendir may not be able to find a particular data block if there was a memory leak where the block wasn't marked free after it was done being used.

Solution 1: There is a data type called ino_t.

stat("foo", &st);	//st.st_ino will give you the inode number which can be used to write directly to the desired block, regardless of whether it has been marked free or not
			//the stat system call return information about a file based on its file system path

Solution 2: fsck (File System Checker)

fsck looks for blocks that have not been freed or blocks with nonzero link counts that aren't found in any directories.

It takes these files and places them in usr/lost+found/01396. Since these files were probably in the middle of being modified when the system crashed, their permissions might be wrong. As a result, fsck only gives read permissions to root.