CS111 - Lecture #12 - File System Implementation Issues

May 12, 2014 (Spring 2014) - Prashanth Swaminathan

Topics covered in this lecture:

  1. Levels of a File System - Exploring By Application
  2. File System Levels - Overview
  3. Determining Free Space In File Systems
  4. Inode Structure
  5. Arguments Against Hard Links
  6. Directory Entry Structure (in ext2)

File System Levels - Exploring by Application

Recall the paranoid professor introduced a few lectures ago? Now suppose that his newest problem is to remove data from a filesystem, so no one can recover and read it. We will explore each possible solution and note its problems:

$rm file
While this might delete the file eventually, what if another process was already holding the file descriptor? On FAT systems and systems that utilize free block bitmaps, a file is removed by setting all blocks it was using to 'free' in the free block bitmap. (This is what was done in Lab 3 in free_block function.) But we have a security flaw here. When the file is removed, we modify only the bitmap and not the data block itself! This means that even if we removed the file, a process with a file descriptor could still access the block and read whatever content remains. The only way to prevent this process from hanging onto the file is to $reboot immediately, assuming that the process has not already written a copy of the file's data somewhere. In any case, this is too drastic and cannot practically be done every time a file needs to be removed.
$shred file
The UNIX/LINUX solution to this problem is the shred command. Shred rewrites the contents of the file with random bytes. After we delete the file, even if a process is holding onto a file descriptor for that file, it will only be able to access garbage data.

Aside #1: Why does shred overwrite with random bytes, instead of zero bytes, which is much faster?
Some storage media can leave traces of information about the original contents of a sector. When a disk head writes on a sector it first attempts to center on the track, but can be slightly off. Peter Gutmann, who presented a paper entitled "Secure Deletion of Data from Magnetic and Solid-State Memory" at the 1996 Usenix conference, explained that the "actual effect is closer to obtaining a .95 when a zero is overwritten with a one, and a 1.05 when a one is overwritten with a one". Given some information about which bits were overwritten, and a sensitive disk head, we could theoretically reconstruct what data existed in the sector.

  • If the data is overwritten with zeros alone, it would be a perfect signpost for a forensic scientist who could then examine that area of the track. Looking at the bits that show signs of not being 'fully' overwritten, the original sector can be read. This trail is sometimes referred to as the 'ghost track'.
  • If the data is overwritten with random bytes, we cannot use the same logic to reconstruct the sector. While a few traces of the original data might be left, it is much more difficult to analyze. However, since shred cannot guarantee that expensive hardware probing will not be able to identify a 'ghost track', it shreds 3 times by default for increased security.
Aside #2: Where does shred get the random data it overwrites with from?
Most applications get random data from the system-wide entropy pool, /dev/random. In the Linux kernel this is a 'device file' that generates its data by looking at the keyboard, mouse, network, and disc activities, with a cryptographic algorithm (SHA1). Since it's purpose is to provide cryptographically secure pseudorandom numbers, /dev/random is a limited buffer. If too many random bytes are requested, /dev/random will exhaust and block until more entropy can be gathered. This makes it inefficient for use with shred. Instead, we use the limitless entropy pool, /dev/urandom. This is a pseudo-random number generator (PRNG) that seeds with a value from /dev/random. Here is one suggested way to get randomness for a PRNG, the RDRAND operation:

  • RDRAND is an instruction available on the Intel x86 and x86-64 architectures to acquire randomness from the chip's thermal energy.
  • Thermal energy is translated into random bits that are then passed through an AES conditioner (encryption) to create secure random numbers for applications. This is not the primary way by which /dev/random acquires its entropy, and it might have been a smart design choice; many critics suggest that the NSA might have inserted backdoors.
  • Additional randomness can be attained by taking and XOR of /dev/random and RDRAND output.
Using a PRNG can greatly speed up shred, at the cost of some security.

Aside #3: What are United States policies for secure file deletion?
We refer to the US Federal Information Processing Standards (FIPS) on file deletion:
  • Security Level 1 (Lowest): Delete files by overwriting with random data (default 3 times).
  • Security Level 2: Delete files by 'degaussing' the drive, basically bathing it in a strong magnetic field.
  • Security Level 3: Delete files by physically shredding the hard drive.
  • Security Level 4 (Highest): Delete files by melting the hard drive.

Log Based File Systems and Bad Blocks - Why Shred Isn't Enough To Delete Data

In a log based file system, data is not modified immediately; rather, a 'log' is posted and the update is done later as necessary. A separate section of the partition holds a record of logs containing the differences between the old and new versions of the file. Shred now no longer works! While it will overwrite the contents of the original file, the log containing the file differences can still be read. So we propose to do this:

$su
#shred /dev/rst/0      // Shred the entire disk!


Now that we've shredded everything, are we safe? Nope. The reason is because of 'bad blocks' in a partition. Occasionally, certain sectors of a disk will go bad; either permanent damage to the disk or failed transistors in flash memory render these sectors unusable. To counteract this, whenever a bad block is detected, it is recorded in the 'bad block map'. This map redirects the file system to a different block whenever it tries to access a bad block. Extra space is pre-approportioned on the disk to allow recovery from bad blocks. The problem: these bad blocks are often stuck in their magnetic or digital state. Shredding is an overwrite operation, and thus will have no effect on a sector that is already dead. All the data contained within these bad blocks can still be read after a shred.

File System Levels - Overview

The main file system abstractions are (from coarse to fine):

  1. Sectors - 512 bytes
    • Subdivision of a track on a magnetic disk
    • Writes to disk must be at least 1 sector long
  2. Blocks - 8192 bytes (16 sectors)
    • Larger block sizes are efficient for sequential calculations
    • Smaller block sizes waste less space (when file sizes are smaller)
    • Smaller block sizes also reduces the amount of disk fragmentation
  3. Partitions
    • Disks might contain a boot, swap, /usr and unused partition, for instance.
    • There is no requirement that partitions share the same file system type.
    ----------------- All of the above are common elements on top of a file system (components listed below) -----------------
  4. Inodes
  5. File Name Components
    • All parts of a file name except the slashes.
    • e.g The file name components of /usr/bin/grep are "usr", "bin", and "grep"
  6. File Names (Path Names)
    • The full path name of a file.
    • e.g The file name of /usr/bin/grep is "/usr/bin/grep"
    • Do not confuse this with the PATH environment variable.
  7. Symbolic Links

Determining Free Space In File Systems

In FAT file systems, free space is determined using the file allocation table which marks all free clusters with a -1. In Version 7 UNIX, no such table existed. To determine how much free space exists, you would have to check how many blocks are in use and then subtract from the total block count, which is an incredibly expensive operation. The solution that was created to solve this is the BSD file system created by UC Berkeley.

Berkeley Fast File System Diagram

In the BSD file system, each individual cylinder group (part of the larger filesystem) contains a superblock, 'cg info' which contains file system descriptors, an inode-bitmap, a free-block bitmap, the inodes, and the data blocks. Specifically, the free block bitmap contains one bit per data block, with 0 indicating that block is free and 1 indicating that the block is in use. The existence of a free block bitmap and the inode bitmap gives us an relatively fast way to determine which blocks are free. Otherwise, we would have to parse through each inode and construct a bitmap each time we want to allocate a file.

Pros and Cons

+ Only 1/216 of the disk is used by adding the free block bitmap.

+ It is relatively fast to determine free blocks and allocate a file appropriately.

- I/O is required to know where to allocate the file.

- If a free block bitmap is cached in memory:

However, building a file system this way allows us to more quickly determine how many blocks are free to write into continguously. The primary goal is to minimize seek time for the disk head. A typical disk takes about 10 ms to rotate into the right position and has to wait for the data to arrive for about 8 ms (rotational latency). If a file is written continguously, I/O is a lot faster because there is less rotation required by the disk head to read sequential data. (Assuming of course, no other activities, like other I/O operations, are requested by the file system.)

Inode Structure

What's in an Inode? What isn't in an Inode?
  • size
  • type (regular file, directory file, symbolic link)
  • # of links
  • date (modification)
  • permission modes
  • addresses of blocks containing data
  • file name
    • if the file is hard linked, it has no single name
  • parent directory
    • if the file is hard linked, it has no single parent directory
  • which processes have the file open
    • large performance overhead
    • kernel would have to keep track in main memory, which is lost on reboot
    • coherency issue unless ALL inodes are updated

Hard links can seem to be more trouble than they're worth sometimes:

Hard Link Looping Scenario

So why keep around hard links:

EXT2 Directory Entry Structure

Directory Entry Structure (in ext2)

A ext2 directory entry contains the following fields in order:

In the example diagram, note that some entries have extra zero bytes. These form the aforementioned junk space. The reason junk space exists is to make entry allocation and deallocation easier. Suppose we wanted to remove the fourth directory entry, test.txt. The method by which we deallocate this space is to look at the previous directory entry, music/ and increase its directory entry length by 16 (the size of the test.txt directory entry). This masks the test.txt directory entry as the junk space for the music/ directory entry. If instead we wished to add a directory entry, we would first find a directory entry with a large enough junk space and then shrink it to get the space we need.

Note that the directory entry and the inode both contain file type information. We have an inherent coherency problem; what if there are inconsistencies between the file type recorded in the inode and the file type recorded in the directory entry? Linux prevents this issue by making the inode's file type immutable. You cannot change an inode's file type and so at the very least, the file cannot be used by the file system incorrectly.