CS 111 Winter 2016

Lecture 13

Scribe Notes for 2/24/16

By Aaron Chung & Nikhil Shridhar

What's the Worst That Can Happen in a Filesystem?

Bug Report for GZIP

gzip: software application used for compression and decompression of data files

A pretty simple program that has been maintained for decades.
input: $gzip foo
output: foo.gz
Now let's take a look at what system calls gzip uses:

ifd = open("foo", O_RDONLY)
ofd = open("foo.gz", O_WRONLY | O_CREAT | O_EXCL, 0600)
if (ofd < 0) error()
The system will do a bunch of reads and writes
- read(ifd, buf, bufsize) //checking for io errors each time
- write(ofd, buf, amount_to_write) //checking for io errors each time


    if (close(ofd) != 0)
        error();
    if (close(ifd) != 0)
        error();
    if (unlink("foo") != 0)
        error();
        
	//bug report says that if system crashes here, gzip could lose all of your data

    exit(0);

If you look at all the system calls, we were very careful, making sure to close the output file. So how is it that this program can lose data?

Why?

This bug is so common that if you look at random applications running on linux, there is a good chance that they will have this same bug (waiting to lose all your data)

The problem is that when the system processes these write requests, it might change the order for efficiency reasons. By reordering the code, this creates the possibility for memory leaks and data loss. In the following sections we describe why these problems appear.

Solution:

if (fsync(ofd) != 0)
	error(0);

Calling fsync causes the program to retain the original request ordering. While this solves the data loss issue, it slows performance significantly.

Boot Sector

This region contains machine code to be loaded into RAM by the computer's built in firmware

Superblock

This region contains a recod of the characteristics of a filesystem includeing its size, the block size, the empty and filled blocks and their respective counts, the size and location of the inode tables, the disk block map and usage information, and the size of the block groups

Inode Table

This region contains machine code to be loaded into RAM by the computer's built in firmware

Free Block Bitmap

This region keeps track of where free blocks are in the data blocks section

When OS needs to make file bigger, we need to allocate more space to file
We would modify inode table to point directly to a new block or indirectly to an indirect block
- It can do this by walking through the inode table THIS IS EXTREMELY SLOW
The way that we make it fast is by having this Free Block Bitmap
Each bit corresponds to a block of data, allowing our file system to go faster
Looking at patterns of 1s and 0s, it finds a free area and allocates it
Using this Bitmap, we achieve a compression ratio of 8192 KiB x 8 bits = 65536
The Free Block Bitmap is used only to increase performance for finding blocks. It is not used to represent the actual data

Inode Table

Each entry in the Inode Table corresponds to one file in the file system

Data Blocks

This region stores the data in each file

Overview of Performance Issues With File Systems

Suppose you wanted to get a block of data from a hard disk

There is an overhead involved with reading blocks from disk:

Overhead = seek time + rotational latency + transfer from disk to cache + transfer cache to RAM = ~18ms

Seek time = 10ms
Rotational Latency = 8ms
Transfer From Disk to Cache + Transfer Cache to RAM = .1ms

18ms = ~18 million instructions

This means that in the same amount of time that it would take you to get one block from disk, you could do 18 million useful units of work. You would probably prefer to do the 18 million instructions.

Performance Metrics for I/O

Utilization

     Fraction of capacity of disk drive doing useful work

What percentage of time you are reading or writing data
If you have low utilization, then you aren't using your IO devices as much as you would like
Often shown as a percentage or as a range from [0, 1]

Throughput

     The rate of request completion

If I use 1000 read requests and get them done in 1 second, throughput would be 1000 requests/second

Latency

     Overall delay between request and response

If I use a read or write command, how long will it take for the system to actually read or write?
File systems that are good at latency typically aren't good at throughput

Increasing latency will typically decrease throughput and utilization. Conversely, increasing throughput and utilization will decrease latency. These are known as competing objectives

How to Improve File System Performance

Batching

If you have a bunch of individual writes to do, don’t do them one at a time. Instead collect them all together into one big write and then do that big write all at once.

This is done at all levels of file systems in the OS.

Classic example of batching is putchar()

while((c=getchar()) != EOf){
     putchar(c);
     }

This code would be really inefficient

Instead, the stdio library turns getchar and putchar into read(0, buf, 4096) and write(1, buf, amount_to_write)

Speculation

File system will guess what your program will do in the near future and act on this guess

For example, when the system turns getchar() into read 4000 bytes, it is guessing that you will want to keep reading. The system guesses what you will want to do and then actually does some of the work before you explicitly ask for it.

When this approach is applied to reading it is called Prefetching
When this approach is applied to writing it is called Dallying
- The system waits to fulfil requests with the hope that it can fulfil multiple requests in a more efficient manor at a later time

Why does this work?

We can rely on LOCALITY OF REFERENCE
If we apply speculation to the stock market or a roulette wheel, it wouldn’t work because there is no bias in the system. So our guessing wouldn’t actually improve performance.
However, applications such as those that call getchar() take advantage of temporal locality.
- If you access location i at time t, you are likely to access location i again in the near future.
This means that access to files aren't random, but rather structured.
Spatial locality can be leveraged as well.
- If you access location i, you are likely to access locations around i as well.

We can think of our disk access code as a system of layers

Block Layer

The block layer knows nothing about the bitmap and inodes. All it knows is the size of the file. It communicates with the rest of the file system using only requests (read and write)

The block layer maintains a cache. If higher levels of the system exhibit locality of reference, a request to read a given block can be very fast.

How can we make file system performance even faster (specifically in regards to the block layer code)

Simple Model of Hard Disk Access

Model the disk as a large array of blocks

At any point in time, the IO head is pointing to some block on disk

To go from the current block h to some other block i, the cost is proportional to the distance |h-i|

Since seek time dominates the overhead time of reading blocks, this model is useful

INTERESTING QUESTION: What is the average cost of seeking a random block to read next?

worst case cost = 1 (this occurs when we are at one end of the disk and want to read the block all the way at the other end of the disk)
average case cost = somewhere between 0.25 and 0.5

We can calculate the average cost of finding the next block as follows:

To account for all of the cases in which block i comes before block h, we integrate the distance |h-i| over the interval from 0 to h. Then, to find the average value of these cases we divide by the range: h. This is represented by the first term in the integral above
To account for all of the cases in which block i comes after block h, we integrate the distance |i-h| over the interval from h to 1. Then, to find the average value of these cases we divide by the range: 1-h. This is represented by the second term in the integral above
To cover all possible values of i and h, the integral is taken over the interval 0 to 1, representing all possible block numbers

As it turns out, the resulting value of the above integral is: Average Cost = 1/3

"Queue" of pending requests:

Requests are stored for a period of time until the block layer decides which to do next. The order doesn't necessarily have to be the order in which the requests appeared.

There are a few different algorithms which are used to choose the next request to execute:

Algorithms for Determining Which Request to Satisfy Next

Shortest Seek Time First (SSTF)

If you have a lot of work to do, pick the request that is closest to the current position. This is a greedy algorithm. It does not always guarantee the minimal cost solution. In addition, it is unfair in the sense that it can lead to starvation. If new requests continue to appear that are closer to the current position than some older request, the older request will never get executed.

There are a few modifications which will take care of the issue of starvation.

Elevator Algorithm

This algorithm takes the shortest seek time in the positive direction until you either reach the end or complete all pending requests. This algorithm does not cause starvation. However, it is not entirely fair because it doesn't satisfy requests in the order of issuance. Also, requests to read or write at either end of the block layer will have longer average wait times than requests in the middle of the block layer.

Circular Elevator Algorithm

This algorithm is essentially the same as the original elevator algorithm except that you ALWAYS seek in the positive direction. Once the end is reached, simply return the IO head to the beginning of the block layer. While this is added work for the block head, it makes the overall algorithm more fair in that all requests will have the same average wait time, regardless of their position in the block layer.

SSTF + FCFS

This algorithm gives each request a score based on how long each it has been waiting and the request's distance from the IO head. This is essentially a greedy algorithm without starvation. No matter how far away a request is from the IO head, eventually it will have waited long enough to score first in the wait queue.

What Can Go Wrong When Renaming a File?

Example: rename("foo", "fuzzy")

Look in the working directory inode to find where foo is on disk. This directory will point to the data blocks containing foo's contents. We want to overwrite the directory data to indicate file name: "fuzzy" instead of "foo" and file name length: 5 instead of 3. This write request will be sitting as a pending request in the wait queue.

What Can Go Wrong Here?

The file name got longer (as in our example length goes from 3 to 5) but the directory block was already completely filled with directory entries (there is no spare space for the longer name).

Suppose the directory grows. What would we need to do?

Find a free block to allocate by looking at the free block bitmap and mark it as used. (The block bitmap is cached in RAM)
Since the directory has grown, the directory data now has 2 blocks in it instead of just one. We need to write out the appropriate part in the inode table for the working directory. (This is probably also cached in RAM)
mark the old directory entry block as free since we are no longer using it

So in what should just be a simple rename operation, instead of just rewriting a block in place, we might have to write several different blocks just to make it happen.

Now lets consider what order we should do those writes in.

Why does order matter?

We are trying to produce a consistent data structure
If the system crashes in the middle of the operation we don't want to lose data

Order to Write

Write the new file name and file name length to the free block found in the free block bitmap (If the program crashes here, the user would not be aware of what happened and nothing will be affected)
Write to the free block bitmap, indicating that the block just written to has been taken (If the program crashes here, this will cause a memory leak
Overwrite the inode block (If the program crashes here, we will have 2 different names for the same file (essentially a hard link) In this case, the user has not lost any data. However, we haven't updated the link count. If we happen to remove the file, then we would have a dangling pointer and the system would be unaware that it has a directory entry pointing to a file that doesn't exist
4. Overwrite the directory block (If the program crashes here, it doesn't matter. The rename request has been completely fulfilled already

File System Invariants

Every Block is used for exactly one purpose
- Boot block
- Super block
- Bitmap
- Inode
- Data
All referenced blocks are appropriately initialized. (Free blocks don't have to be initialized because they are not referenced)
All referenced data blocks are marked as used in the bitmap
All unreferenced data blocks are marked free in bitmap

Penalties for Violating These Invariants

Suppose you use the same block for 2 differet purposes. When you overwrite one block, you also overwrite the other. This can cause data loss.
Suppose you reference an improperly initialized block. This might cause you to look at files you aren't supposed to.
Suppose the referenced data blocks aren't marked as used in the free block bitmap. The program would think these blocks are free and might overwrite them causing you to lose your data.
Suppose an unreferenced block is not marked free in the free block bitmap. This block wouldn't be able to be reused, causing a memory leak.

How to Solve the Issue of Storage Leaks

fd = opendir("/usr"); //this is the equivalent of "find". It finds all of the files in the file system

opendir may not be able to find a particular data block if there was a memory leak where the block wasn't marked free after it was done being used.

Solution 1: There is a data type called ino_t.

stat("foo", &st);	//st.st_ino will give you the inode number which can be used to write directly to the desired block, regardless of whether it has been marked free or not
			//the stat system call return information about a file based on its file system path

Solution 2: fsck (File System Checker)

fsck looks for blocks that have not been freed or blocks with nonzero link counts that aren't found in any directories.

It takes these files and places them in usr/lost+found/01396. Since these files were probably in the middle of being modified when the system crashed, their permissions might be wrong. As a result, fsck only gives read permissions to root.