Lecture 11: File system design

Winter 2015 Week 7 Scribe Notes - 02/18/2015

By: Bryan Hoyt, Kevin Wong

Review: Simple disk I/O

Simple performance model for disk I/O With the model of a long connected list of locations for disk, the cost of a read is dominated by seek time. The cost is the absolute value of the difference in locations. (location a – location b)
Finding the Average Read Cost: (Assumes random reads)

Consider a continuous approximation where locations are from 0 to 1.
The average cost at position 0 is 0.5, the average cost at position 1 is 0.5.
At the half way point, the average is 0.25. Doing again at position 0.25, one can find a parabolic curve. The average cost will be the area under this parabolic curve:
The integral from 0 to 1 dh, inside being the integral from 0 to h of h-x dx and h to 1 of x-h dh.
The resultant average is 1/3. That is, the average cost of a read is 1/3 the length of the disk.

Example; Set of read requests: {20, 97, 512, 109321} Current read is at location 96.
Algorithms for Seek Time

SSTF -- “Shortest Seek Time First”

Best throughput (waste least time seeking)
Potential for starvation; If all requests stay near each other, a far request will never resolve.

FCFS -- “First Come First Serve”

Worse Throughput
No Starvation

Hybrid Approach

Both SSTF and FCFS are priority scheduling. (SSTF, seek time; FCFS arrival time) Hybrid approach where the priority (“niceness”) is set to the sum of seek and arrival time.
Is more fair (no starvation), and has higher throughput (closer earlier sometimes)

Elevator Algorithm

Another algorithm that pays no attention to arrival time but still wants fairness.
Elevator algorithm, shortest seek time first, but only works in one direction at a time.
Zigzags along the disk. No starvation. Ends with a wait time similar to the top parabola.

Circular Elevator

Instead, of zigzagging though, it always goes one direction, and then speeds back to the bottom, and goes in the same direction.
Two programs coming in with requests: A and B. A does a sequential read from the beginning of the disk. B does a sequential read from somewhere else in the disk. (A begins at 1, B begins at 1001)

Do A for some time, pass on B for a while. Eventually go to B and do its requests.

Anticipatory Scheduling

Even though head is at location 1, and the next request is 1001, purposely delay the execution in the hope that process A requests location 2 soon.
Relies on guessing programming behavior. As a result, programs can use special system calls that do nothing but say that it will probably issue another request in a certain pattern. The system works specifically with hoping for patterns.

Multiprogramming

Multiple programs each with their own IO needs. As a result, parallel is possible.
Potentially the requests can be batched, and a serial order can be done efficiently.

File System Design

A hybrid data structure which partially lives on flash/disk and partially on RAM. (Data structure design that lives on both secondary and RAM storage) You want the capacity and permanence of secondary storage and performance of RAM.
Professor Eggert’s 1974 UG Research Project was to duplicate UNIX on Lockheed. To write a text editor, he needed to write a file system. He created a very simple file system: First, he divided the disk as an array with sectors. All files begin at the beginning of a sector, all parts of the file are contiguous. If files aren’t a factor of 512, allocate an extra sector, and waste the extra bits. Keeping track of files; The first few sectors are filled with table which tells us where files are. (a directory) Say 2048 bytes, an array of three things: file name, start sector, file size in bytes (Used 16 bits for start sector, and file size; so that’s 4 bytes. Limited name at 12 bytes) Put nulls at the end of the name to signify the end of name.

Similar file systems developed. Ex. RT-11 around 1972. (RT = real time)

Advantages:

Simple
Predictable
Good performance for sequential IO to one file

Downsides:

waste up to 511 bytes/file (internal fragmentation); external fragmentation
Because files are contiguous, cannot move files.
External Fragmentation: enough space to put in a file, but not contiguous, so cannot use.
Also downside: directory imposes a limit on number of files
Truncating a file is very simple (just change size in directory)
Growing a file is difficult (only possible with enough free space) Solution: Ask for a maximum output size so system can reserve space. Requires applications to know their own maximum size first.
File names are also pretty short. (longer wastes space)
No user created directories (no nested directories)

Other “competition” were trying to avoid certain programs such as: External fragmentation, directory limit, user-created directories, app output, file growth Leads to:

FAT File System (late 1970’s) [File Allocation Table = FAT]

Source: http://www.cs.ucla.edu/classes/spring12/cs111/scribe/12a/

The disk is modeled as an array with sectors. Each block is 4 KiB, or 8 sectors.

Files are treated as linked lists of blocks. The next block is not necessarily yours.

The next field is in the FAT. It is a table of next fields in the format of an array of 16 bit numbers.

A file can now be represented in a 16 bit number, as the beginning of a linked list.

File block locations are randomish, so sequential access is slow Removes problem of external fragmentation.

In FAT, a directory is a file with contents of directory entry.

File names are two sections: 8 bytes and 3 bytes (ie. foo.c = foo00000 c00)

Two bytes that tell you the block number of first block A byte for type of file.

0 for regular file, 1 for directory A byte for file size.

(Which does mean you can waste block-1 size) All 0’s in an entry means unused.

At the very beginning of the system is a small area “super block” Contains file system information: version, size, root directory

Downsides:

-Still imposes a limit on files (although not really a problem) 2 to the 12 bytes, with 2 to the 16 bytes.

-Biggest file is 512 megabytes. Current FAT is FAT32. (all 16 are 32)

-Hard to keep track of free space. (In the directory, put -1 so that program knows it’s a free block).

Fixed most of the RT-11 problems, but new problems:

-Sequential IO can be expensive, performance slower (seeky)

[General fix is having a file repair procedure] or defragmentation putting fragments of each file together so they are contiguous, allowing sequential IO faster. But defragmentation is very expensive.

-Possible to corrupt file system (ex. losing power during defragmentation)

-Potential problem when renaming files (foo.c to bar.c is easy so good). But renaming from a/foo.c to b/foo.c is much worse. In first case, if power goes out, file name change won’t be saved (not so bad)

In second, if copy a directory first, lose file completely; if copy b directory first, potentially pointing to a bad location. As a result, it is not allowed to move files between directories.

Unix file system (~1977)

Source: https://sites.google.com/site/drewcsci325spring2013/projects/01---file-system/unix-version-6
Similar to FAT, but attempts to avoid defrag problem and rename file problem.

The fix was an inode; A small fixed size amount of memory with metadata about files. The inode is cached in RAM during use. Otherwise stored in disk.

New file system:

Data blocks with an inode table at the beginning Each inode describes one file.
Inode does not include name. Instead, directories had two entries: 14 bytes for name, and 2 bytes for the inode number. (inode would contain the other entries)
Possible for two names to have same inode number. (linking) Linking is necessary for safely moving files between directories.
One problem is link count can overflow. (Solution: disallow too many links)

If link count becomes 0, it considers as free space, else knows someone wants it.

Add a directory entry in new directory (+1 link), and remove old entry (-1 link).
What is contained in inodes: Size in bytes, type, date, permissions, number of links, and a small arranged number of block nodes which describe which blocks are needed for defragmentation.

The next block node is an indirect block which represents a great number of blocks (megabytes).

The last block node is an indirect block to indirect blocks (double indirection).

Example.

git clone Doesn’t really copy all files. It just points to the files with links. Linking is fast.

System call link leaves old name, and adds a new name. Originally there was no rename. Instead, just use link then unlink.

link(“a”, “b”)

Potential problem: Circular links. A directory points to itself or even a parent. Thus, cycles are prohibited in Unix. Thus, you cannot make hard links to directories at all. Only to files.
Suppose you do: (rm foo; cat) < foo Foo begins with link count 1. At rm foo, link count goes to 0, thus is free. Fix: you can reclaim a file’s storage if link count is 0 and no one has file open.

An OS property, not a file system property

Unix file system still hasn’t fixed defrag problem. But Unix has lseek which is faster. Lseek(fd, N, 0) is O(N) in FAT, O(1) in Unix

Berkeley Fast File System variant: At the start of the file system which is a bitmap of free blocks. Each data block has a bit. Size is 1/65000 ratio. Can fit in RAM because it’s small, and can easily find new free blocks. Helps with contiguous allocation. Helps solve defrag problem partially.