Lecture 11 - Spring14 Scribe Notes - Dylan Gray, Sharon Hsieh

Strategies for Improving File System Performance

Batching

Batching is used for reads and writes. Data from multiple jobs is stored in blocks until the read/write is performed, minimizing overhead. Large block sizes result in good throughput, since it minimizes the number of times read/write is called, while small block sizes provide good latency, as smaller jobs are processed quicker.

Prefetching

Prefetching only works for reads, not writes. In prefetching, the OS guesses where the app will read next. While the app is busy, the OS reads in the next block so that it is ready when the app needs it. In this sense, we get "double buffering for free." However, the downside to this approach is that we may fetch an extra block we didn't need. This ties up the bus for data that never gets used.

The OS handles the guessing at runtime. There are multiple methods for guessing, including sequential guessing, backward sequential/previous guessing, and assuming random access (random access guess). In sequential guessing, the OS has noticed that blocks are being read in increasing order, so it will automatically read in the next one. Similarly, in backward sequential/previous guessing, the OS notices that blocks are being read in backward order, so it automatically reads the previous block. If the OS assumes Random Access, then nothing gets cached and prefetching is effectively turned off, as it cannot guess which block will need to be accessed next.

Prefetching is also part of a larger concept called "Speculation," which is described below.

Speculation

An OS which speculates makes guesses about the near future. Speculation is designed to improve performance. These guesses could possibly result in more work for the OS now, but less work later. In other words, the next instruction performed may not have as good performance as it would if the OS had not been speculating, but overall performance is improved.

Speculation assumes spatial locality and temporal locality. Spatial locality assumes that blocks near each other will need to be read. For example, if you read the ith block, the OS assumes that you'll probably want the i+1 and i-1 blocks as well. Temporal locality assumes that blocks which were just read will be needed at some time in the very near future. For instance, if you read the ith block at time t, the OS assumes you'll want the ith block at time t + d, where d is small.

Dallying

Dallying is the write-equivalent to prefetching. It is only used for writes. When an application tells the OS to write something, the OS doesn't actually do it right away. In dallying, the app says something like write(fd, buf, 1024), which returns 1024, but the write will go into main memory. Rather than being written to disk, the OS simply caches what needs to be written for now. The OS can then write over it, eliminating the previous write job that would have been done, or it can append to what's already in the cache, batching the work.

CPU Cache (D4)
\|
--------------------	-------------	----------	-----------------------------
	\|		\|
	RAM (D3)		Disk Controller Cache (D2)
			Disk (D1)

A downside to dallying is cache coherence. The CPU's cache, RAM, disk controller's cache, and disk can all have coppies of the same data. D1 is on the disk, D2 on the disk controller's cache, D3 on RAM, and D4 in the CPU's cache as shown above. When all copies of the data D1-D4 are the same, there is no issue. If they are different, then this can cause problems. If power is lost, then only the D1 copy of the data on disk is preserved. Inconsistencies in copies of data can be avoided by using sync(), fsync(fd), and fdatasync(fd). sync() moves RAM data to the disk by scheduling all cached blocks for writing to the disk (i.e. "stop dallying"). However, a problem with sync() is that it works on all processes, whereas we may only want to write data associated with a particular file, and writing all of RAM to disk will take a long time. Furthermore, sync schedules the work, so it returns before all the data is actually written, meaning that the user does not know for sure that the data is safely on disk. An alternative to sync is fsync(fd). This function only syncs parts of RAM associated with the file descriptor fd. It ensures the file descriptor's blocks are on the disk (usually a long wait time ranging from 10 to >100 ms) before it returns. Though fsync() is an improvement on sync, it still has its issues. When data is updated using fsync(), more than one block needs to be updated. That is, both the data and metadata (i.e. change the "modified time") have to be updated. fdatasync() is the best choice because it only syncs the data, not the metadata. This would be good for scenarios such as keeping track of bank accounts, where we care about the balance and not the transaction time. However, not updating the metadata could lead to problems such as $ls -l showing the wrong information after a crash.

File Systems

A file system is a logical structure used for managing files. File systems are hybrid data structures, where the principle part lives on disk, and the rest, which is needed for performance, lives in main memory.

AVSFS (c. 1974)

File System Setup:

Filenames Array

file0 data

////////

file1 data

///////

file2 data

///

Filenames Array Element:

Filename

Offset

File Size

Allocated Size

AVSFS was created by Professor Eggert in 1974. It has an array of filenames and locations at the beginning of the hard drive. The location indicates where on disk the file starts. The rest of the drive is comprised of continuous arrays of data that are files.

This file system has some advantages. It is very simple, and it can sequentially access data very quickly. However, there are downsides. Due to all files needing to be continuous, there is external fragmentation. In other words, there might be enough space on disk to store all of your data, however it is not all continuous, so the file can not be stored. Also, because the array of filenames and addresses are created when the directory is first created, there is a fixed number of files that can be created. Even if there is enough space on disk to create a new file, there is no space to save it in the directory array. This array also needs to be small to fit into RAM, meaning the system can only support a small number of files. This system is also non-hierarchical. This means that there are no directories or subdirectories, just one root directory. Finally, because all data needs to be continuous, the user has to specify an upper bound on the amount of data to be stored in the file when they first create it. This means that you cannot dynamically change the file size, and you must be able to approximate the size before creating the file.

This file system is similar to RT-11 (1971), which is a "real time" file system built for reliability/predictability over performance.

FAT File System (1970s)

File system setup:

Boot Sector

Superblock

FAT

Data Blocks:

...

Directory:

Filename	1st Block #	Size (bytes)	File Type
...	...	...	...
...	...	...	...

In the FAT file system, the boot sector holds information for starting the system. Professor Eggert talked about the boot sector during first week when discussing "The Paranoid Professor." The superblock contains information about the file system such as size, statistics, block size, block number of root directory, and other helpful information. The data is broken into 4KB blocks, with 8 sectors per block. The location of all the blocks containing a file are stored in FAT. FAT is an array of 16 bit values, where each data block has an entry. It stores the value of the next data block to read. So, if a.txt is split across block 7 and then block 12, the 7th enty in FAT is 12, indicating that the file continues in block 12. If the block is the final block used for the file, its FAT entry is 0, indicating EOF. If a block is unused, all bits in the FAT entry are 1's, which is the two's complement of -1.

This system has some advantages. We do not have to state the size of file when it is created, and the file can grow as needed. There is also no external fragmentation. However, sequential access is slow. To find a file, we must traverse a linked list through potentially every file to find what we are looking for. There is also internal fragmentation, where, for example the first three-quarters of a file is stored in block 72, and the final quarter is in block 6. This means that the data is not continuous, and that we waste a lot of space in block 6 because it is mostly empty. Because of these disadvantages, a defragmenter is highly recommended when using this file system.

To create directories, FAT creates a file, but when this file is read, it is treated specially. This special file contains a table of names, the block number of the block where the file starts, the file size, and file type (directory or file) for each file in the directory.

Unix File System (c. 1973)

File system setup:

Boot Sector

Superblock

Inode Table

Data Blocks:

...

Inode Table:

Owner	Group	Permissions	Time Stamp	Size	Link Count	Block 1	Block 2	...	Block N
...	...	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...	...

Directory:

Filename	Inode #
...	...
...	...

In the Unix file system, the inode table is indexed by inode number, and contains one entry per file. Each of these entries holds the metadata for the file, including owner, group, permissions, time stamp, size, and block numbers. There are a fixed amount of block numbers that can be stored in one inode entry, so what do you do if you want to have a large file? To do this, the final block number points to a block of block numbers (called single indirection). This expands the potential size of a file greatly. If you still need more space, you can insert one more block number entry into an inode which is doubly indirected, meaning it points to a block, with a series of pointers to blocks which contain block numbers. If a block number is 0, it means that it is empty, and while reading from this block, all values are read as 0. This allows data to be compressed if it is mostly zeroes with small "islands" of 1s. To implement directories, there is a file which is a table of 14-byte names and 2-byte inode numbers (as shown above). Because there are 16 bytes total per entry, this makes files easy to index. This also allows the file system to be hierarchical.

The advantages of such a file system is that it scales easily. A system with this setup could be 16 MB or 1 TB as long as the inode numbers were long enough to support that size. Seeks are also very fast. However, there are some downsides. Because the inode table is a fixed size, the number of files is fixed as soon as the file system is created. There is also internal fragmentation, meaning that a file could be stored in chunks spread across the disk. This makes sequential reads slow.

Hard Links

A hard link is when two filenames are associated with the same inode number. This means that if we delete one file name, the inode it points to must not be deleted, because another file still needs the data stored there. How do we keep track of how many filenames point to an inode? We would need to keep a link count in the inode for the number of files pointing to that inode.

For more information on hard links, click here.

Fun Fact: Professor Eggert used to be a psychology major!
Back to top

CS111 - Lecture 11 - May 7, 2014

Scribes: Dylan Gray, Sharon Hsieh

Table of Contents