CS111 Lecture 11: File System Design

May 9, 2013

By Justin Morgan

1. Disk Scheduling Algorithms

As a high volume of I/O requests are issued by applications for control of a single readable or writable device, the operating system must decide how to queue these requests in order to minimize latency and promote fairness of selection.

A scheduling algorithm places several sequential I/O requests in a wait queue according to properties of these tasks such as their arrival time, the locations of their target sectors on the disk, and the latency incurred by moving the disk head to these sectors.

Schedule Algorithm

To illustrate scenarios where a scheduling algorithm might be useful, we will assume a simple disk model by constructing an array of N blocks labeled from 0 to N - 1, where one block corresponds to one sector. At a given point in time, the disk head h can point to any of these blocks in the array, and with a non-empty wait queue must move to the next scheduled job at block i. The cost of accessing this block is given as |h — i|. For simplicity's sake we set the maximum value of |h — i| to 1. As more requests come in, the next queued index i can appear at an arbitrary location in the block array. There are a variety of ways in which task i can be prioritized in the queue.

First-Come, First-Serve (FCFS)

In first-come, first-serve scheduling, the operating system prioritizes incoming I/O requests based on their arrival time. The next chosen block position i to read or write corresponds to the request that chronologically came first before other tasks in the queue.

FCFS Scheduling Example
An example of FCFS scheduling. Tasks A, B, C, D, and E are present in the queue at t = 0 in order of arrival time.

We would like to measure the average disk access cost of an incoming job i given this approach. Consider an instance where h is located at one end of the disk:

FCFS Problem A

In the worst case, h will have to move N blocks to the opposite side to reach i, resulting in a total cost |h — i| = 1. In the best case, i will occur at the exact position of the head, where |h — i| = 0. On average, the cost of moving the head to the target block is approximately |h — i| = 0.5.

Now we place the head directly in the center of the array, at h = N/2:

FCFS Problem B

Since the maximum distance the head will move to the next job is N/2 blocks, the average seek cost is around |h — i| = 0.25. Given any position for h, the average cost can be found by integrating |h — i| twice with respect to both i and h:

Average Seek Cost

In terms of seek time, we would like to do better than an average cost of 1/3.

Shortest Seek Time First (SSTF)

Using the shortest seek time first scheduling strategy, the operating system will prioritize I/O tasks on the queue that are closer to the head position as these requests arrive. This method maximizes throughput and optimizes latency by allowing jobs that are close together, such as in contiguous block reads, to be completed first before requests that require access to a farther block on the disk.

Shorest Seek Time First

However, this introduces the problem of starvation, where requests associated with blocks far away from the disk head are forced to wait while jobs with shorter seek times are served first, as in the following example:

SSTF Issue
Tasks with short seek times (S) are submitted in high frequencies at arbitrary arrival times near h. As more of these tasks arrive, the task at the far end of the disk (L) is continually push to the back of the wait queue, causing it to starve.

Elevator Algorithm

The elevator algorithm for disk scheduling follows the same principle as SSTF, but serves requests that are located closer to the head in its current direction of movement. This is analogous to an elevator, which stops to pick up passengers while moving between upper and lowers floors in a continuous direction. This approach eliminates the starvation issue of SSTF since a request is guaranteed to be completed regardless of the arrival order of other tasks.

Elevator Algorithm

However, the elevator algorithm is not as fair as FCFS since tasks that come in behind the disk head must wait for the head to switch directions. Consider the following two scenarios:

Elevator Algorithm Issue A Elevator Algorithm Issue B

The task labeled i in the first diagram will wait longer than task i in the second diagram due to the first task's location just behind the head at the far end of the disk. This emulates the case where elevator patrons on the bottom and top levels of a building must wait a longer time for the elevator to pass onto their floor than patrons who request the elevator at floors in between.

There are two solutions that can alleviate the fairness cost. The first is to place blocks that are more frequently accessed in the center of the disk. The second (and more preferred approach) is to utilize a circular elevator that proceeds only in one direction, such as from left to right, and jumps to the left side once the right boundary has been reached. The circular elevator normalizes seek time costs of all possible locations of i at the time cost of rewinding the disk head for each complete pass of the disk.

Circular Elevator
An example execution of the circular elevator algorithm. The faded line in the timing diagram represents the overhead of repositioning the disk head to block 0.

Anticipatory Scheduling

We can apply the principle of dallying to disk scheduling by delaying response to disk requests in anticipation that closer read or write jobs will come in at a later time. The operating system can allocate a certain amount of time to wait, such as 2 ms, until an I/O request that is closer to the head position arrives. Consider the below example, where successive read jobs are issued in the system one at a time:

Anticipatory Scheduling

Since only one request exists in the queue at a time, the disk head will move to block 10, then block 500, then 11, 501, 12, and so on. With anticipatory scheduling, if these requests come in suitable time periods, while servicing block 10 the scheduler will wait in anticipation that that contiguous blocks 11, 12, 13, and 14 will be requested at later times, reducing the overhead of migrating the disk head across large distances.

2. Flash Storage

Compared to hard disks, flash memory operates substantially faster than a hard disk since it does not contain moving parts such as a disk read-and-write head, and therefore does not suffer from seek time latency. However, flash memory tends to be more expensive, and flash storage blocks are susceptible to wearing out when multiple writes are performed at the same location in memory. Due to this risk, most flash controllers implement wear leveling, where the controller moves frequently modified write blocks around in memory to ensure that the target block lasts longer. As a result, the controller must keep track of these locations somewhere in flash memory, serving as metadata that also must migrate often to reduce wear. These operations degrade the performance of flash memory for high-occurrence write operations.

Moreover, to perform writes to flash memory, it is required that blocks are erased before the write is performed. Writing to flash occurs in a bulk operation, and in turn places a premium on sequential writing. This impedes on the idea that writing can be performed via random access (like reading), since entire blocks must be erased and rewritten each time.

3. File Systems

A file system is a data structure existing on disk (or in SSD) that provides an abstraction of how the operating system defines and manages a set of files. To illustrate general concepts and issues associated with how file systems can be organized, we will explore three major examples of file system design.

Paul Eggert File System - (c. 1974)

The “Paul Eggert File System” is a design created specifically for the development of a minicomputer, with the intuition of how files should be stored on disk. The system can be visualized as a straight line of contiguous storage. Files are placed on the boundaries of the disk’s 512-byte sectors. Data for each file is allowed to spill over these boundaries, resulting in unused space between portions of sectors.

Paul Eggert File System

The first 16 sectors of the disk are reserved to store a fixed-size directory, consisting of an array of directory entries. Each 20-byte entry consists of three fields: the first 16 null-filled bytes represent the name of the directory, the next 16 bits are the starting sector index of the file, and the last 16 bits are the length of the file in bytes. The directory is labeled as unused if the first field consists of all null bytes. Since each entry is of fixed size, each file can be discovered by simply scanning the directory table for existing file entries based on the name field.

The greatest advantage of this design is simplicity. In addition, since data is stored sequentially on disk, contiguous reads are relatively fast.

However, several disadvantages are inherent in this design:

  • The format of the directory table disallows the definition of subdirectories.
  • Since the table is of fixed size, there is an imposed limit on the number of files the system can store at once.
  • The size of each file must be known in advance in order to place it on disk. In the least, you will need an approximation of the space requirement of the file, as well as the likelihood that the file will grow beyond its allocated location.
  • The directory can possibly be corrupt. For instance, directory indices can specify a file that overlaps other entries, which can go unchecked.
  • Fragmentation can easily occur given the organization of files on the disk. This arises in two forms:
    • Internal fragmentation results due to the restricted sizes of allocated blocks. For instance, a 532-byte file will occupy two sectors and waste 492 bytes of free space in its second sector following the file data. On average, the file system will waste 256 bytes of free space per sector. Internal Fragmentation
    • External fragmentation happens when enough free space exists on the disk, but these blocks of free space do not occur contiguously and is not available all at once. As a result, allocation requests for large files may be rejected even though enough space exists for them in the file system.
      External Fragmentation

FAT - (Late 1970s)

External fragmentation is one of the most deterring aspects of file system performance. The FAT (File Allocation Table) file system was designed especially with the motivation to eliminate the occurrence of this type of fragmentation.

FAT
  • Data is broken into 4 KiB blocks, or 212 bytes each. With 512-byte sectors, each block spans 212 / 29 = 8 sectors on disk.
  • The first sector is reserved for the boot sector, in case this partition is bootable from the device.
  • The superblock contains metadata about the file system such as the FAT version, block size, file system size, FAT length, location of the root directory, etc.
  • The FAT is represented by an array of integers, with each index number corresponding to a block in data. Each index can have three possible entries:
    0= Free block
    > 0= Index of next block in the current file cluster
    -1= EOF, or the last block in the current file cluster

There exists one FAT data entry for each block in data. In the FAT 16 architecture, block indices are 16 bits wide. With 4 KiB block sizes (212 bytes), the file system can reach a maximum size of 212 x 216 = 228 bytes = 0.25 GiB. In FAT 32, the maximum size can reach 212 x 232 = 244 bytes = 16 TiB.

FAT Directory

Directories are stored in data as non-fixed-size tables to avoid imposing a limit on how many files can be stored. A directory is defined by an array of directory entries which consist of the following fields according to the FAT 16 specification:

  • 11-byte null-filled filename split into an 8-bit name field and 3-bit extension. This convention spawned the familiar 8.3 filename syntax present in old versions of DOS and Microsoft Windows.
  • 32-bit field for the file size, in bytes
  • 16-bit field for the index of the first block
  • Additional bit fields for extra data

Files are read by jumping between block entries in the FAT, starting with the block index defined in the third field of the file’s directory entry. For instance, in the below example, file A is stored in block entries 30, 99, and 54. The file ends at block index 54, which is indicated by an EOF.

FAT Example

We might be concerned if the FAT’s jumping mechanism has an adverse effect on performance. To retrieve data from a file, the disk head would have to move from the FAT to a data block, then back to the FAT, then to the next data block, and so on. Ideally, we would like to cache the FAT into RAM such that the disk would only have to move from data block to data block to perform a read. In FAT 16, each entry in the FAT consists of two bytes per block. With 4,096-byte blocks, the FAT represents roughly 2 / 4096 = 1 / 2048 = 1 / 211 of the file system. The largest possible FAT 16 system consists of 228 bytes. As a result, we would only need to fit 228 / 211 = 217 bytes = 0.125 MiB into cache, which is entirely feasible.

Since the FAT file system divides disk space into blocks, it is allowed to place portions of files across noncontiguous regions depending on how much free space exists between blocks in the data field. This effectively eradicates the problem of external fragmentation. In turn, however, FAT sacrifices the performance benefit of contiguous reads by splitting files across the disk.

Three core issues exist with the FAT file system:

  • Performance suffers when files themselves are fragmented across the disk. Developers of FAT attempted to solve this issue proactively with programs called defragmenters, which locate scattered blocks and attempt to reorder them contiguously (FAT[i] == i + 1) as if allocated by RT-11. Defragmenters tend to degrade performance and tie up the disk, since file reads and writes must halt while the defragmenter reorders blocks.
  • Moving files between directories can risk placing the file system in a corrupt state. To accomplish this, two directory entries must be modified one at a time, either by erasing the old entry and adding the new entry in order, or adding the new entry first and erasing the old one after. Power Cut If power is cut between the two operations, either the file system will lack a directory entry pointing to file data in the first case, or two directory entries will point to the same file in the second case. In the second situation, if later on you decide to remove the file associated with the first directory entry, the FAT system will mark the data blocks associated with the file as free to be claimed later on by additional files. The extraneous directory entry will point to junk blocks and result in disaster if the directory entry is later accessed by the user.

Berkeley FFS - (1980s)

The Berkeley FFS (Fast File System) tackles the two aforementioned issues corresponding with the FAT design. Several structures are borrowed from the previous example, with the addition of a free space bitmap to mark which blocks are free or are in use (one bit per block), and a region to store inode tables:

Berkeley FFS

Originally utilized in many Unix systems, an inode (index node) represents a data structure on disk that describes a file independent from a directory listing. The file system keeps track of fixed-size inodes in a fixed region, each structured as follows:

Inode

To prevent these structures from growing indefinitely, an inode contains ten direct pointers to data blocks on the disk, from block 0 to block 9. Blocks are partitioned into 8 KiB = 8192 bytes each, and pointers in inodes are 32 bits wide. These 10 pointers alone will thus allow storage for 10 x 8192 bytes of data.

An eleventh pointer contains the location of an indirect block in data with 8192 / 4 = 2048 additional pointers to data blocks, allowing 2048 x 8192 = 224 bytes of additional data storage.

A twelfth pointer in the inode points to a double indirection block containing pointers of up to 2048 more indirect blocks, adding up to 20482 x 8192 = 235 bytes of data. This constitutes a large amount of possible storage for a single file at the cost of some overhead since indirect blocks can constitute 8192 x (2048 + 1) ≈ 224 bytes of data for each file. With indirect blocks, only 224 / 235 ≈ 1/2000th of the disk is wasted, less than a percent of lost disk utilization.

FFS Directory

Directory entries are simplified in the FFS model. Contained in a directory block, each directory entry lists only the name of the file and a pointer to its respective inode, which contains additional information associated with the file such as pointers to its data blocks on disk. This design choice tackles the second problem with FAT by adding a level of indirection, where the old and new directory entries are modified in place of moving the inode itself.

With the inode approach, holes can exist between data block pointers, which represent space that is not allocated for a file. Reading from holes in a file will return 0. Writing to holes in a file will result in allocation.

Moreover, files can be easily truncated by zeroing data pointers:

fd = open(“new”, O_CREAT | O_RDWR);
ftruncate(fd, 1000000000);

This will result in the creation of an inode defining a gigabyte-size file with a large hole of data blocks defined by zeroes in the table. At a later time these blocks can be allocated by simply filling in the hole.

In the original Unix file system, the rename() system call did not exist due to the difficulty of implementing the feature. To perform an equivalent operation of renaming old to new, the following canonical series of actions would have been executed:

if (link("old", "new") == 0) (Creates two directory entries to the same file)
unlink("old");

Here, link() can be implemented by doing a simple write to the destination directory. Similarly, unlink is performed by writing to the source directory. Neither the original inode nor its respective data is affected by the operation.

When an unlink is performed, the file system may not free the file’s data blocks immediately. An additional field in the inode maintains a link count of how many directory entries in data maintain references to that particular inode. For each new link created, the link count is incremented. For each unlink(), the count is decremented. The space associated with the file is freed only on the two-part condition that the inode’s reference count is equal to zero and the file is not open to any current process.

Due to the possibility of directory corruption, for instance when power is cut unexpectedly, these reference counts can be misrepresented when the file system is reinitialized. In Unix/BSD, the application fsck is run right after rebooting to repair the file system and mark blocks containing no references from other files. Instead of claiming the space immediately, fsck may place these disassociated blocks in a directory called the lost and found.

We will discuss more on file system design and its implementation next time...