CS111 - Project 3A: File System Dump
Project 3 is expected to involve much more programming than any of the other projects. You will design and implement programs to analyze file systems and diagnose corruption. In part A, you will write a program to read the image of a file system, analyze it, and summarize its contents in several csv files. These csv files will be used in part B, where we will analyze the consistency of file system data structures and diagnose problems.
Part A can be broken into two major steps:
RELATION TO READING AND LECTURES:
This project more deeply explores the file and directory concepts described in Arpaci chapter 39.
This project is based on the same EXT2 file system that is discussed in sections 40.2-40.5. Part B of this project goes much deeper than the introductory discussion of integrity presented in sections 42.1-2.
A single tarball (.tar.gz) containing:
Historically, file systems were almost always been implemented as part of the operating system, running in kernel mode. Kernel code is expensive to develop, difficult to test, and prone to catastrophic failures. Within the past 15 years or so, new developments have made it possible to implement file systems in user mode, improving maintainability … and in some cases delivering even better performance than could be achieved with kernel code. All of this project will be done as user-mode software.
To ensure data security and integrity, disks are generally protected from access by ordinary applications. Linux supports the creation, mounting, checking, and debugging of file systems stored in ordinary files. In this project, we will provide EXT2 file system images in ordinary files. Because they are in ordinary files (rather than protected disks) you can access/operate on those file system images with ordinary user mode code.
To complete this assignment, you will need to learn a few things:
Your program will be tested on multiple (broken) file systems. Here we provide one as an example. SEASnet servers do not support commands like sudo(8), mount(8), or debugfs(8), so to play with our provided file system image, you will have to install a Linux distribution (if you do not have one) on your own computer. For simplicity and convenience, you may choose to install Linux inside a virtual machine, for example VirtualBox (it’s free!).
Then download this image file, and mount it onto your own Linux, with the following commands:
mkdir fs
sudo mount -o loop disk-image fs
sudo chown -R $USER fs … where $USER is your user-name
Note: this file is 200 Mbytes in size.
Now, you can navigate the file system, just like an ordinary directory, with commands like ls(1), cat(1), and cd(1). After you are done with it, you can unmount with the following command:
sudo umount fs
Before you start writing your C program to interpret the “disk-image” file, you can explore it further using debugfs(8) (in your own Linux). You may find many useful commands in its man page. Some particularly helpful ones are: stats, stat, bd, testi, and testb. If you have problems interpreting parts of the file system, you can use the debugfs program to help you understand the correct contents.
To ensure you are correctly interpreting the file system image, we have included many unusual things, which would not likely be caught by an incorrect implementation:
In this step, you will write a program called lab3a that:
Please note that, although you cannot mount the provided image file on departmental servers, your lab3a program should be able to run on departmental servers, just like all previous assignments.
There are to be six csv-format files, each summarizing a different part of the file system. Remember, you can always check your program’s output against debugfs’s output. All the information required for the summary can be manually found and checked by using debugfs.
The file system images your program examines will, in many cases be damaged (as might result from missing or mis-directed writes). This means that you will have to do some Defensive Programming in your analysis program: validating parameters and pointers before you use them. In addition to correctly analyzing the file system and group parameters, inodes, block pointers, and directory entries, your program will also be expected to report (to stderr) any invalid parameters or addresses.
A basic set of file system parameters.
fields(9) | format |
magic number | hex |
total number of inodes | dec |
total number of blocks | dec |
block size | dec |
fragment size | dec |
blocks per group | dec |
inodes per group | dec |
fragments per group | dec |
first data block | dec |
Sample correct csv output
Ef53,51200,204800,1024,1024,8192,2048,8192,1
Sanity checking
Sample error messages (to stderr):
Superblock - invalid magic: 0xdead
Superblock - invalid block size: 666
Superblock - invalid block count 200000 > image size 50000
Superblock - invalid first block 100000 > image size 50000
Superblock - 20000 blocks, 1050 blocks/group
Superblock - 1000 Inodes, 66 Inodes/group
It is important to validate superblock parameters, because they are used to determine the addresses of inodes and to validate block pointers. If a file system fails any of these tests, log an error and exit without any further processing.
Information about each cylinder group
fields(7) | format |
number of contained blocks | dec |
number of free blocks | dec |
number of free inodes | dec |
number of directories | dec |
(free) inode bitmap block | hex |
(free) block bitmap block | hex |
inode table (start) block | hex |
Sample output (just for the first cylinder group)
8192,7913,2024,2,4,3,5
Sanity checking
Sample error messages (to stderr):
Group 7: 100000 blocks, superblock says 50000
Group 7: blocks 100000-150000, free Inode map starts at 165000
Group 7: blocks 100000-150000, free block map starts at 165000
Group 7: blocks 100000-150000, Inode table starts at 165000
Group descriptor validation is also critical because this table tells us where the inodes and free bit-maps begin. If a group descriptor fails any of these tests, log an error, and do not attempt to dump out bit-maps or inodes that are not within the group.
A list of free inodes and free blocks.
One line of output for each free inode or block in the bitmap. No output should be produced for allocated inodes or blocks.
fields(2) | format |
block number of the map | hex |
Free block/inode number | dec |
Sample output (just first five entries)
3,275
3,276
3,277
3,278
3,279
Sanity checking
Key information for each allocated inode.
An allocated inode has a file type.
fields(11+15) | format |
inode number | dec |
file type | char[1] |
mode | oct |
owner | dec |
group | dec |
link count | dec |
creation time | hex |
modification time | hex |
access time | hex |
file size | dec |
number of blocks | dec |
block pointers * 15 | hex |
Sample output (first four inodes only)
1,?,0,0,0,0,58201197,58201197,58201197,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,d,40755,1000,0,5,58201213,58201213,582011bd,3072,3,105,a602,a107,0,0,0,0,0,0,0,0,0,0,0,0
3,?,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,?,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Sanity checking
It is important to validate all block pointers before we use them:
Sample error message (to stderr):
Inode 407 - invalid block pointer[9]: 3ab403
If any block address falls outside the reasonable range, include it in the output, log an error message. But if an out-of-range block is an indirect block or contains directory entries, do not attempt to further interpret its contents.
The valid/allocated entries in each directory.
An allocated directory entry can be recognized by a non-zero inode number. Any directory entry that has a zero inode number should be assumed to be empty and not included in the output.
The term parent inode number, below, refers to the inode number of the directory that contains all of the entries that are being listed.
The term entry number, below, is a number that starts at 0 for the first directory entry, and is incremented for each entry that is encountered (whether allocated or not). If the first ten entries are allocated, the next is free, and the last is allocated, the output should describe entries: 0-9, and 11, with no output for entry 10.
fields(6) | format |
parent inode number | dec |
entry number | dec |
entry length | dec |
name length | dec |
inode number of the file entry | dec |
name | string[2] |
Sample output (first three directories)
2,0,12,1,2,”.”
2,1,1012,2,2,”..”
2,2,20,10,11,”lost+found”
Sanity checking
If a group descriptor fails any of these tests, log an error. If either the name or entry length is unreasonable, stop interpreting this directory:
Sample error message (to stderr):
Inode 204, block 3c05 - bad dirent: Inode = 107538
Inode 204, block 3c05 - bad dirent: len = 3040, namelen = 5240
These are all the non-zero block pointers in an indirect block. The blocks that contain indirect block pointers are included.
fields(3) | format |
block number of the containing block | hex |
entry number within that block | dec |
block pointer value | hex |
Sample output (first three entries)
1e10f,0,1e110
1e10f,1,1e111
1e10f,2,1e112
Sanity checking
It is important to validate all block numbers in indirect blocks for the same reason we had to validate block numbers in inodes: because we will follow those pointers to further interpret the file system.
Sample error message to stderr:
Indirect block 3c4 - invalid entry[103] = 304c8df
If any block number is out of range log an error message. If a out-of-range block is an indirect block, do not attempt to dump its contents.
For each different kind of entry, output its summary in a separate file. The names of your files should be:
In these csv files, each line represents one entry, and should be ended with a single new-line character ( ‘\n’, 0x0a). The field orders in those csv files should be the same as listed above, and your program should use and only use a comma to separate each field.
For your convenience, here are the exact six csv files generated by our solution: super.csv, group.csv, bitmap.csv,inode.csv, directory.csv, and indirect.csv.
Please carefully check the Rubric section (especially the Code review subsection) and all (latest) footnotes before submitting your code.
Your tarball should have a name of the form lab3a-studentID.tar.gz and should be submitted via CCLE.
We will test your work on a SEASnet GNU/Linux server running RHEL 7 (this is on lnxsrv09). You would be well advised to test your submission on that platform before submitting it.
Value Feature
Packaging and build (10%)
3% untars expected contents
3% clean build w/default action (no warnings)
2% Makefile has clean and dist targets
2% reasonableness of README contents
Code review (15%)
You can check header files like ext2_fs.h to have an idea about the variable types. HOWEVER, you shall write up your own code for the calculations like block/inode (number), indirect block pointer, various different kinds of offset, and so on. So in sum, do all the calculations with your own code, do not use macros/functions provided in header files like ext2fs.h.
5% overall readability and reasonableness, correct program name and csv filenames
10% image file investigation correct (use pread() and write your own code for calculations)
Results (75%)
⅔ points shall go to outputting the required fields correctly, and ⅓ points go to correct lines (penalty will be given for extra/missing lines).
10% super.csv
10% group.csv
10% bitmap.csv
15% inode.csv
15% directory.csv
15% indirect.csv
[1] We only recognize a (small) subset of the file types: ‘f’ for regular file, ‘d’ for directory, and ‘s’ for symbolic links. For other types, use ‘?’.
[2] Double-quoted. Don’t worry about how to print out non-graphical characters or embedded quotes.