Lecture 11

File System Performance:

This is a continuation of Lecture 10

Prefetching... When is it bad?

Prefetching normally assumes locality of reference .
Assumes that the data we are prefetching is correct. This is a problem of cache coherence .
Say two processes hold a cache for a file that they are reading in. This cache contains the prefetched contents of the file. Process 1 then writes to the file. Now Process 2's prefetched data could be inconsistent with the file.
We can fix this issue by moving the cache into the kernel and giving each process access to this cache. Now whenever process 1 or 2 writes they will both see the others changed data.

Dallying Downsides

Say we issue the following.
write(...);
write(...);
write(...);
write(...);
write("done");
[crash]
These writes assume that we are writing to disk but in reality we have been writing to cache. And if the machine crashes before we can flush these changes to disk those changes will be lost. So we introduce a write barrier . A write barrier will flush out all the cached writes to disk on command.
We have multiple possible function calls to accomplish this:

sync();
This pushes every cache to disk. This means the caches for every single file currently being cached. Obviously this is undesirable since an individual process doesn't need to flush every cache. Just those that it uses.
fsync(fd);
This function does just that. It flushes all parts of the cache associated with fd to disk. Thus removing the penalty of dumping all files. This is still too slow for practical applications. So there is yet another alternative.
fdatasync(fd);
This function is faster because it doesn't attempt to dump a files metadata at all. It only dumps the data portion associated with fd. This makes it so we don't have to write the disk in more than one place. If we had to write metadata we would seek to the block bit map and seek to the data. With this function we only seek to the data.

File System Design

Story Time

IBM announced they are shipping a 120 petabyte file system to a nameless customer. Cheapest fast drives were 600GB and we assume they are 15K rpm drives. And we need ~200,000 drives.

IBM story

Problems

Various disks must be consistent.
Need to give each set of drives to a CPU, but disks don't require too much CPU time. This means we will have many disks connected to one CPU but we still need a lot of CPUs to to interact with the disks efficiently. (One CPU means one bus which means all traffic flows on one wire which is highly inefficient).
Need an interconnect with high throughput and low latency (lot of bits/second)
Having multiple processors now causes us to have cache coherency issues since each processor holds it own cache. So we need a new way to implement the sync functions specified earlier.

This all comes down to having as much parallelism as possible.

To take on this problem lets look at creating a simple file system first.

Simple File System

Our simple file system will just be considered an array of sectors.
If we want a file we just put it in the first available section of memory. Also have a region to hold file metadata. A directory is 256 bytes which is an array of dir entries which contain info about where the files are. There are no breaking of files, they are stored contiguously.

Problems

Fragmentation:
- Internal Fragmentation: This is the wasted bytes in a sector that a file doesn't use.
- External Fragmentation: This is the problem caused by storing the files contiguously. If there is a file placed in the middle of our file system and we want to add a file thats bigger than space on either side of the file we can't add it even though in reality there is enough free space its just split by the middle file.
We need to know the size of a file upon creation since we store them contiguously.
We have an upper limit on the number of files limited by our metadata section.

Advanced File System Technology

FAT File System

Goal

Want to kill the external fragmentation problem and the preallocating (atleast for data).

Details

File
Allocation
Table
So we have the following divisions:

Boot Sector

Super Block

FAT

............ data ...........

Boot Sector

We didn't learn this piece of the filesystem but here is a link: Boot sector explanation

Super Block

This contains metadata about the file system. These include the size of the file system, the version number, and other assorted data.
The Super Block has a fixed size.

File Allocation Table

A FAT is an array of block numbers. For each 4Kib block in our file system we have an entry in the FAT. Each entry is either a 0 => EOF , -1 => Free Block, N : Block number of next block in the file we are looking at. This creates a linked list of storage. With the next pointers being the block numbers in the FAT.
This solves the problem of external fragmentation. Since we can string non-contigous blocks of memory together for a single file.

Data

This is just the actual file data.

Issues

We assumed that the FAT was storable in RAM but this isn't scalable since as the number of blocks increases so does the number of required FAT entries. So our FAT increases with the number of sectors.
Contiguous allocation is desirable. This is clearly possible early on, but as we add and remove files our disk gets fragmented and we can't get contiguous allocation as easily.
lseek(fd, N, SEEK_SET); Is now an O(N) operation because we have to traverse the linked list looking for the correct block. Which will require the disk arm to move around the FAT if it isn't in RAM.
Rename doesn't work since we need to write in two different directories.

We can fix all this by introducing a new file system approach.

Inode Idea

Static Data Structures

As with the FAT we have a boot sector and a super block, but the difference is the inode table.
The inode table contains all the memory to hold all the files that will exist. This is a parameter set upon file creation.

Inode

An inode is an on disk data structure for each file in our file system. This is all preallocated, which means there is a max number of files our file system can have upon creation.
It has the following structure.

A File size.
A link count.
A type.
Premissions.
10 Direct Blocks - Block numbers telling which blocks contain file data.
An Indirect Block - A block containing a block number.
A Doubly Indirect Block - A block to a block containing a block number.

Improvements

lseek(...) is now O(1) because we can just find the block by offset into the file.
But we still need to fix the other 3 problems specified earlier.

As an aside, the Berkeley File System also includes a block bit map, which says if a block is free or not. This helps encourage contiguous allocation.

Lecture 11

File System Performance:

Prefetching... When is it bad?

Dallying Downsides

File System Design

Story Time

Problems

Simple File System

Problems

Advanced File System Technology

FAT File System

Goal

Details

Boot Sector

Super Block

File Allocation Table

Data

Directory

Issues

Inode Idea

Static Data Structures

Inode

Improvements