Lecture 11
File System Performance:
This is a continuation of Lecture 10
Prefetching... When is it bad?
- Prefetching normally assumes
locality of reference .
- Assumes that the data we are prefetching is correct.
This is a problem of cache coherence .
Say two processes hold a cache for a file that they are reading in.
This cache contains the prefetched contents of the file. Process 1 then writes to the file.
Now Process 2's prefetched data could be inconsistent with the file.
We can fix this issue by moving the cache into the kernel and giving each process access to this
cache. Now whenever process 1 or 2 writes they will both see the others changed data.
Dallying Downsides
Say we issue the following.
write(...);
write(...);
write(...);
write(...);
write("done");
[crash]
These writes assume that we are writing to disk but in reality we have been writing to cache.
And if the machine crashes before we can flush these changes to disk those changes will be lost.
So we introduce a
write barrier .
A write barrier will flush out all the cached writes to disk on command.
We have multiple possible function calls to accomplish this:
-
sync();
This pushes every cache to disk. This means the caches for every single file currently being cached.
Obviously this is undesirable since an individual process doesn't need to flush every cache.
Just those that it uses.
-
fsync(fd);
This function does just that. It flushes all parts of the cache associated with fd to disk.
Thus removing the penalty of dumping all files. This is still too slow for practical applications.
So there is yet another alternative.
-
fdatasync(fd);
This function is faster because it doesn't attempt to dump a files metadata at all. It only dumps
the data portion associated with fd. This makes it so we don't have to write the disk in more than
one place. If we had to write metadata we would seek to the block bit map and seek to the data. With
this function we only seek to the data.
File System Design
Story Time
IBM announced they are shipping a 120 petabyte file system to a nameless customer.
Cheapest fast drives were 600GB and we assume they are 15K rpm drives. And we need ~200,000 drives.
IBM story
Problems
-
Various disks must be consistent.
-
Need to give each set of drives to a CPU, but disks don't require too much CPU time.
This means we will have many disks connected to one CPU but we still need a lot of CPUs to
to interact with the disks efficiently. (One CPU means one bus which means all traffic flows on one wire
which is highly inefficient).
-
Need an interconnect with high throughput and low latency (lot of bits/second)
-
Having multiple processors now causes us to have cache coherency issues since each processor
holds it own cache. So we need a new way to implement the sync functions specified earlier.
This all comes down to having as much parallelism as possible.
To take on this problem lets look at creating a simple file system first.
Simple File System
Our simple file system will just be considered an array of sectors.
If we want a file we just put it in the first available section of memory.
Also have a region to hold file metadata. A directory is 256 bytes which is an array of
dir entries which contain info about where the files are.
There are no breaking of files, they are stored contiguously.
Problems
-
Fragmentation:
-
Internal Fragmentation: This is the wasted bytes in a sector that a file doesn't use.
-
External Fragmentation: This is the problem caused by storing the files contiguously. If there is a file
placed in the middle of our file system and we want to add a file thats bigger than space on either side of the file
we can't add it even though in reality there is enough free space its just split by the middle file.
-
We need to know the size of a file upon creation since we store them contiguously.
-
We have an upper limit on the number of files limited by our metadata section.
Advanced File System Technology
FAT File System
Goal
Want to kill the external fragmentation problem and the preallocating (atleast for data).
Details
File
Allocation
Table
So we have the following divisions:
Boot Sector | Super Block | FAT | ............ data ........... |
Boot Sector
We didn't learn this piece of the filesystem but here is a link:
Boot sector explanation
Super Block
This contains metadata about the file system.
These include the size of the file system, the version number, and other assorted data.
The Super Block has a fixed size.
File Allocation Table
A FAT is an array of block numbers. For each 4Kib block in our file system we have an entry in the FAT.
Each entry is either a 0 => EOF , -1 => Free Block, N : Block number of next block in the file we are looking at.
This creates a linked list of storage. With the next pointers being the block numbers in the FAT.
This solves the problem of external fragmentation. Since we can string non-contigous blocks of memory together for a single file.
Data
This is just the actual file data.
Directory
A directory is a file whose contents are an array of directory entries.
A directory entry contains a name, a size, and the index of the first block in the file.
...Name... | Size | Block index in FAT |
We also need a root directory. This is simple just have it be the first entry in the FAT.
Issues
-
We assumed that the FAT was storable in RAM but this isn't scalable since as the number of blocks increases
so does the number of required FAT entries. So our FAT increases with the number of sectors.
-
Contiguous allocation is desirable. This is clearly possible early on, but as we add and remove
files our disk gets fragmented and we can't get contiguous allocation as easily.
-
lseek(fd, N, SEEK_SET); Is now an O(N) operation because we have to traverse the linked list looking
for the correct block. Which will require the disk arm to move around the FAT if it isn't in RAM.
-
Rename doesn't work since we need to write in two different directories.
We can fix all this by introducing a new file system approach.
Inode Idea
Static Data Structures
As with the FAT we have a boot sector and a super block, but the difference is the
inode table.
The inode table contains all the memory to hold all the files that will exist. This is a parameter
set upon file creation.
Inode
An inode is an on disk data structure for each file in our file system.
This is all preallocated, which means there is a max number of files our
file system can have upon creation.
It has the following structure.
-
A File size.
-
A link count.
-
A type.
-
Premissions.
-
10 Direct Blocks - Block numbers telling which blocks contain file data.
-
An Indirect Block - A block containing a block number.
-
A Doubly Indirect Block - A block to a block containing a block number.
Improvements
lseek(...) is now O(1) because we can just find the block by offset into the file.
But we still need to fix the other 3 problems specified earlier.
As an aside, the Berkeley File System also includes a block bit map, which says
if a block is free or not. This helps encourage contiguous allocation.