CS 111

Lecture 11 Scribe Notes: File System Design

Scribes: Zhen Zhou, Ding Zhao, and Haixiang Huang


Table of Contents

  1. Prefetching
    1. Cache Coherence Problem
    2. Problems of Dallying
  2. File System Design Request
  3. Dumbest Possible File System
  4. FAT File System
  5. The Unix file system
    1. Inode
    2. Block Bit Map
    3. Directory entries

1. Prefetching - When is it a bad idea? - File System Performance Continued

1.1 Cache Coherence Problem

  1. We assume LOCALITY OF PERFORMANCE, which is not always true. We may waste memory for things that will never be used.
  2. We assume CACHE COHERENCE, which sometimes is false. Cashe may not faithfully reflect what is on the disk.
Example:

cache


Here, Process2 thinks it is reading the file by reading the cache belongs to it,
However, the actual file content has been changed after process1 writes to the file.
FIX: Don't give each process different cache. Make cache belongs to file (instead of process), and lock cache (when doing I/O)

1.2 Problems of Dallying

Example of Dallying:
write();
write();
write();
//what the process thought: write to disk
//what actually happened: write to cache
write(done);
//to user
//but what if it crashes here?

Look into above code, several writes have been issued to files, and users have been received message "done".
However, what's actually happened is processes thought they were writing to disk while really they were writing to cache.
What if the system crashes at this time with all the contents in the cache?

Solution: Write Barrier

The sync() function commits to disk all data in the cache.

Sync(); //flushes entire cache to Disk

However, since it is too global, most of the applications would be slowed down just to protect one application.
Let's look at another approach fsync(fd).
It flushes data of the file referred to by the file descriptor fd to disk. The data here include both the real data and the metadata about the file. This approach could be slow because writing metadata also takes a while.

fsync(fd); //flush this file blocks to disk

It flushes data of the file referred to by the file descriptor fd to disk. The data here include both the real data and the metadata about the file. This approach could be slow because writing metadata also takes a while.
The last one is:

fdatasync(fd); //only flush data, not meta data

This one is fast and useful, since sometime user do not care about timestamps, size, etc.


2. File System Design Request

Suppose that your are going to design a file system with the following requrements:

  1. Total size of the file system is 120 PB
  2. Build it with minimum cost


Since you want each disk as cheap as possible, but also with maximum capacity.
You choose 600 GB drives. A simple calculation gives that 200,000 disks are needed.

Then you run into problems:

  1. varies disks must be consistent with each other
  2. you have multiple CPUs (each with many disks)
  3. they need a high-throughput, low-latency, and interconnected
  4. as many parallelism as possible
  5. cache coherence issue - each CPU is caching its own
  6. robustness - e.g. write barriers (fdatasync)


In the following sections, we are going to talk about several file systems that may become the solution.


3. Dumbest Possible File System

Now we try to design a simple file system.

Example:

512 bytes
sectors (directory)
file1file2file3<--------------------data-------------------->


Directory: array of directory entries (each directory entry is 256 bytes (total 200 entries))

Each directory entry has the following structure:

0/1 (in use)timestart lengthName


The 1972 RT-11 behaves just like the above, in which RT stands for real-time. Because it reads file sequantially, and predictability is ensured.

Problems:


Problem 1 fragmentation
A. internal.(small) e.g., a file has a size of 20^20-3 bytes. Then only 3 bytes are not used.
B. external (big deal)

II. Preallocation Required
A. shrinking is easy. Growing is docey.
B. directory size is fixed. (perhaps it is set at filesystem creation time)


4. FAT File System

FAT stands for File Allocation Table
The goal to have FAT File System is to resolve two issues
1. External Fragmentation
2. Preallocation

The layout of FAT File System:

boot sectorSuper BlockFAT<--------------------data-------------------->


Super Block:
The beginning of the Super Block contains block number of first block of root directory
The Super Block is small, and its size is fixed.
The Super Block also contains Meta Data About File System The Meta Data contains the size, version of FAT, etc.

FAT:
FAT is an array of block numbers
each entry of the FAT can have different values
0 : eof (End of File)
-1: free block
N : block number of next block in this file
Whenever we gets the first block of the file, we can keep getting the next block number until we reached the end of file
When trying to find a new empty block for new contents, we simply search the FAT to find an entry with value -1
Initially, data blocks are allocated sequentially (since all blocks are empty and sequential)
However, later on, when early blocks are removed, and we are trying to fit a new file (which is larger than the size of the block we freed), the file might be saved non-sequentially (depending on the implementation of the file system)

Directory:
In FAT, a directory is a file, but the content of this file is interpretated differently
The content of the directory file is array of directory entries
Each entry will points to a different file

Problems with FAT file system:

  1. Assume that the area of FAT s cacheable
  2. Contiguous allocation is desirable, which means that
    - file system gets busy because of creation or deleation - you get fragmentation (defragmentation can be a solution, but it is slow, we cannot tell user that we need to defragment the disk, and this might never catch up with the new creation/deletion speed)
  3. The time complexity of lseek is O(N)
  4. rename(please fix this)

5. The Unix file system

To have a first impresion of what the unix file system is, see the below diagram

boot sectorsuperblockinode table<--------------------data-------------------->

The boot sector and the superblock have similar usage as FAT file system. The big improvement of the unix
file system is its use of inode structure.

5.1 Inode

An index-node (inode) as a data structure to stores all the information about a regular file, directory or other file system object.
The following graph sketches its basic structure:

inode

The infos area stores the file's attributes such as the size of the file, the type of the file and link count.
In a typical inode, except for file attributes there are:

  1. Ten pointers points that directly point to blocks containing the data for the file. These are called direct pointers.
  2. One single indirect pointer. This pointer points to a block of pointers that point to blocks containing the data for the file.
  3. One doubly indirect pointer. This pointer points to a block of pointers that point to other blocks of pointers that point to blocks containing the data for the file.
Side problem:
Assume disc blocks are 8K bytes and that disc addresses are 32 bits:
- What is the largest possible file using above inode?
As a disc block is 8 kB (8192 bytes), and disc addresses are 32 bits (4 bytes), each disc block can hold up to 2048 (8192 / 4) disc address entries.
This means that the singly indirect block can reference up to 2048 * 8 kB bytes of file data (16 MB). Extending this, we find that a doubly indirect
block can reference up to 2048 * 2048 * 8kB of file data (32 GB).

To conclude:
Up to 80 kB of file data can be accessed using the direct blocks. This translates into file offsets 0 through (80 kB - 1) inclusive.
Up to 16 MB of file data can be accessed using the singly indirect block stored in the inode. This translates into files offsets 80 kB through (16 MB + 80 kB - 1) inclusive.
Up to 32 GB of file data can be accessed using the doubly indirect block stored in the inode. This translates into file offsets (16 MB + 80 kB) through (32 GB + 16 MB + 80 kB - 1) inclusive.
We can even add a triply indirect pointer which extends the max file size to 64 TB.

5.2 Block Bit Map

The Berkeley filesystem adds a block bit map into the file system.

boot sectorsuperblockblock bit mapinode table<--------------------data-------------------->

Block bit map contains an array of bits. Each bit indicates whether corresponding data block is free.
This can be used to realize the problem of contiguous allocation.

5.3 Directory entries

Now we have the idea of how files are stored in the Unix file system.
Each directory entry is represented as:

14bytes name2bytes inode number

That is, a file is actually a name and an inode number!
For example, a dirctory can be represented as follows:

"foo.txt" 27
"gah.txt" 48
"hello.txt" 27