May 9, 2013 |
By Justin Morgan |
As a high volume of I/O requests are issued by applications for control of a single readable or writable device, the operating system must decide how to queue these requests in order to minimize latency and promote fairness of selection.
A scheduling algorithm places several sequential I/O requests in a wait queue according to properties of these tasks such as their arrival time, the locations of their target sectors on the disk, and the latency incurred by moving the disk head to these sectors.
To illustrate scenarios where a scheduling algorithm might be useful, we will assume a simple disk model by constructing an array of N blocks labeled from 0 to N - 1, where one block corresponds to one sector. At a given point in time, the disk head h can point to any of these blocks in the array, and with a non-empty wait queue must move to the next scheduled job at block i. The cost of accessing this block is given as |h — i|. For simplicity's sake we set the maximum value of |h — i| to 1. As more requests come in, the next queued index i can appear at an arbitrary location in the block array. There are a variety of ways in which task i can be prioritized in the queue.
In first-come, first-serve scheduling, the operating system prioritizes incoming I/O requests based on their arrival time. The next chosen block position i to read or write corresponds to the request that chronologically came first before other tasks in the queue.
We would like to measure the average disk access cost of an incoming job i given this approach. Consider an instance where h is located at one end of the disk:
In the worst case, h will have to move N blocks to the opposite side to reach i, resulting in a total cost |h — i| = 1. In the best case, i will occur at the exact position of the head, where |h — i| = 0. On average, the cost of moving the head to the target block is approximately |h — i| = 0.5.
Now we place the head directly in the center of the array, at h = N/2:
Since the maximum distance the head will move to the next job is N/2 blocks, the average seek cost is around |h — i| = 0.25. Given any position for h, the average cost can be found by integrating |h — i| twice with respect to both i and h:
In terms of seek time, we would like to do better than an average cost of 1/3.
Using the shortest seek time first scheduling strategy, the operating system will prioritize I/O tasks on the queue that are closer to the head position as these requests arrive. This method maximizes throughput and optimizes latency by allowing jobs that are close together, such as in contiguous block reads, to be completed first before requests that require access to a farther block on the disk.
However, this introduces the problem of starvation, where requests associated with blocks far away from the disk head are forced to wait while jobs with shorter seek times are served first, as in the following example:
The elevator algorithm for disk scheduling follows the same principle as SSTF, but serves requests that are located closer to the head in its current direction of movement. This is analogous to an elevator, which stops to pick up passengers while moving between upper and lowers floors in a continuous direction. This approach eliminates the starvation issue of SSTF since a request is guaranteed to be completed regardless of the arrival order of other tasks.
However, the elevator algorithm is not as fair as FCFS since tasks that come in behind the disk head must wait for the head to switch directions. Consider the following two scenarios:
The task labeled i in the first diagram will wait longer than task i in the second diagram due to the first task's location just behind the head at the far end of the disk. This emulates the case where elevator patrons on the bottom and top levels of a building must wait a longer time for the elevator to pass onto their floor than patrons who request the elevator at floors in between.
There are two solutions that can alleviate the fairness cost. The first is to place blocks that are more frequently accessed in the center of the disk. The second (and more preferred approach) is to utilize a circular elevator that proceeds only in one direction, such as from left to right, and jumps to the left side once the right boundary has been reached. The circular elevator normalizes seek time costs of all possible locations of i at the time cost of rewinding the disk head for each complete pass of the disk.
We can apply the principle of dallying to disk scheduling by delaying response to disk requests in anticipation that closer read or write jobs will come in at a later time. The operating system can allocate a certain amount of time to wait, such as 2 ms, until an I/O request that is closer to the head position arrives. Consider the below example, where successive read jobs are issued in the system one at a time:
Since only one request exists in the queue at a time, the disk head will move to block 10, then block 500, then 11, 501, 12, and so on. With anticipatory scheduling, if these requests come in suitable time periods, while servicing block 10 the scheduler will wait in anticipation that that contiguous blocks 11, 12, 13, and 14 will be requested at later times, reducing the overhead of migrating the disk head across large distances.
Compared to hard disks, flash memory operates substantially faster than a hard disk since it does not contain moving parts such as a disk read-and-write head, and therefore does not suffer from seek time latency. However, flash memory tends to be more expensive, and flash storage blocks are susceptible to wearing out when multiple writes are performed at the same location in memory. Due to this risk, most flash controllers implement wear leveling, where the controller moves frequently modified write blocks around in memory to ensure that the target block lasts longer. As a result, the controller must keep track of these locations somewhere in flash memory, serving as metadata that also must migrate often to reduce wear. These operations degrade the performance of flash memory for high-occurrence write operations.
Moreover, to perform writes to flash memory, it is required that blocks are erased before the write is performed. Writing to flash occurs in a bulk operation, and in turn places a premium on sequential writing. This impedes on the idea that writing can be performed via random access (like reading), since entire blocks must be erased and rewritten each time.
A file system is a data structure existing on disk (or in SSD) that provides an abstraction of how the operating system defines and manages a set of files. To illustrate general concepts and issues associated with how file systems can be organized, we will explore three major examples of file system design.
The “Paul Eggert File System” is a design created specifically for the development of a minicomputer, with the intuition of how files should be stored on disk. The system can be visualized as a straight line of contiguous storage. Files are placed on the boundaries of the disk’s 512-byte sectors. Data for each file is allowed to spill over these boundaries, resulting in unused space between portions of sectors.
The first 16 sectors of the disk are reserved to store a fixed-size directory, consisting of an array of directory entries. Each 20-byte entry consists of three fields: the first 16 null-filled bytes represent the name of the directory, the next 16 bits are the starting sector index of the file, and the last 16 bits are the length of the file in bytes. The directory is labeled as unused if the first field consists of all null bytes. Since each entry is of fixed size, each file can be discovered by simply scanning the directory table for existing file entries based on the name field.
The greatest advantage of this design is simplicity. In addition, since data is stored sequentially on disk, contiguous reads are relatively fast.
However, several disadvantages are inherent in this design:
Internal fragmentation results due to the restricted sizes of allocated blocks. For instance, a 532-byte file will occupy two sectors and waste 492 bytes of free space in its second sector following the file data. On average, the file system will waste 256 bytes of free space per sector. |
External fragmentation happens when enough free space exists on the disk, but these blocks of free space do not occur contiguously and is not available all at once. As a result, allocation requests for large files may be rejected even though enough space exists for them in the file system. |
External fragmentation is one of the most deterring aspects of file system performance. The FAT (File Allocation Table) file system was designed especially with the motivation to eliminate the occurrence of this type of fragmentation.
0 | = Free block |
> 0 | = Index of next block in the current file cluster |
-1 | = EOF, or the last block in the current file cluster |
There exists one FAT data entry for each block in data. In the FAT 16 architecture, block indices are 16 bits wide. With 4 KiB block sizes (212 bytes), the file system can reach a maximum size of 212 x 216 = 228 bytes = 0.25 GiB. In FAT 32, the maximum size can reach 212 x 232 = 244 bytes = 16 TiB.
Directories are stored in data as non-fixed-size tables to avoid imposing a limit on how many files can be stored. A directory is defined by an array of directory entries which consist of the following fields according to the FAT 16 specification:
Files are read by jumping between block entries in the FAT, starting with the block index defined in the third field of the file’s directory entry. For instance, in the below example, file A is stored in block entries 30, 99, and 54. The file ends at block index 54, which is indicated by an EOF.
We might be concerned if the FAT’s jumping mechanism has an adverse effect on performance. To retrieve data from a file, the disk head would have to move from the FAT to a data block, then back to the FAT, then to the next data block, and so on. Ideally, we would like to cache the FAT into RAM such that the disk would only have to move from data block to data block to perform a read. In FAT 16, each entry in the FAT consists of two bytes per block. With 4,096-byte blocks, the FAT represents roughly 2 / 4096 = 1 / 2048 = 1 / 211 of the file system. The largest possible FAT 16 system consists of 228 bytes. As a result, we would only need to fit 228 / 211 = 217 bytes = 0.125 MiB into cache, which is entirely feasible.
Since the FAT file system divides disk space into blocks, it is allowed to place portions of files across noncontiguous regions depending on how much free space exists between blocks in the data field. This effectively eradicates the problem of external fragmentation. In turn, however, FAT sacrifices the performance benefit of contiguous reads by splitting files across the disk.
Three core issues exist with the FAT file system:
The Berkeley FFS (Fast File System) tackles the two aforementioned issues corresponding with the FAT design. Several structures are borrowed from the previous example, with the addition of a free space bitmap to mark which blocks are free or are in use (one bit per block), and a region to store inode tables:
Originally utilized in many Unix systems, an inode (index node) represents a data structure on disk that describes a file independent from a directory listing. The file system keeps track of fixed-size inodes in a fixed region, each structured as follows:
To prevent these structures from growing indefinitely, an inode contains ten direct pointers to data blocks on the disk, from block 0 to block 9. Blocks are partitioned into 8 KiB = 8192 bytes each, and pointers in inodes are 32 bits wide. These 10 pointers alone will thus allow storage for 10 x 8192 bytes of data.
An eleventh pointer contains the location of an indirect block in data with 8192 / 4 = 2048 additional pointers to data blocks, allowing 2048 x 8192 = 224 bytes of additional data storage.
A twelfth pointer in the inode points to a double indirection block containing pointers of up to 2048 more indirect blocks, adding up to 20482 x 8192 = 235 bytes of data. This constitutes a large amount of possible storage for a single file at the cost of some overhead since indirect blocks can constitute 8192 x (2048 + 1) ≈ 224 bytes of data for each file. With indirect blocks, only 224 / 235 ≈ 1/2000th of the disk is wasted, less than a percent of lost disk utilization.
Directory entries are simplified in the FFS model. Contained in a directory block, each directory entry lists only the name of the file and a pointer to its respective inode, which contains additional information associated with the file such as pointers to its data blocks on disk. This design choice tackles the second problem with FAT by adding a level of indirection, where the old and new directory entries are modified in place of moving the inode itself.
With the inode approach, holes can exist between data block pointers, which represent space that is not allocated for a file. Reading from holes in a file will return 0. Writing to holes in a file will result in allocation.
Moreover, files can be easily truncated by zeroing data pointers:
fd = open(“new”, O_CREAT | O_RDWR);This will result in the creation of an inode defining a gigabyte-size file with a large hole of data blocks defined by zeroes in the table. At a later time these blocks can be allocated by simply filling in the hole.
In the original Unix file system, the rename() system call did not exist due to the difficulty of implementing the feature. To perform an equivalent operation of renaming old to new, the following canonical series of actions would have been executed:
if (link("old", "new") == 0) (Creates two directory entries to the same file)
unlink("old");
Here, link() can be implemented by doing a simple write to the destination directory. Similarly, unlink is performed by writing to the source directory. Neither the original inode nor its respective data is affected by the operation.
When an unlink is performed, the file system may not free the file’s data blocks immediately. An additional field in the inode maintains a link count of how many directory entries in data maintain references to that particular inode. For each new link created, the link count is incremented. For each unlink(), the count is decremented. The space associated with the file is freed only on the two-part condition that the inode’s reference count is equal to zero and the file is not open to any current process.
Due to the possibility of directory corruption, for instance when power is cut unexpectedly, these reference counts can be misrepresented when the file system is reinitialized. In Unix/BSD, the application fsck is run right after rebooting to repair the file system and mark blocks containing no references from other files. Instead of claiming the space immediately, fsck may place these disassociated blocks in a directory called the lost and found.
We will discuss more on file system design and its implementation next time...