CS 111 (14Spr) Lecture 12

$ rm file

Naive attempt to purely remove the file
Nobody can see the file but this doesn't mean this method works!
What if there is an open file descriptor? If so, another process could still read the file's contents

$ rm file
$ reboot

The professor may attempt to remove the file and then reboot in order to clear the file completely
This is a drastic solution but the data will still not be safely removed! It will still sit in the file system
The professor also wants remove to be fast, and reboot does not offer that
The effects of these commands depend on the professor's file system...

Where is this remove function implemented?

FAT file system

Take the directory and throw it away
Take all block entries, walk through a linked list and store -1 in all to free the blocks
Note: Data has not been erased because we want to maintain some speed

UNIX file system

Similar to FAT except there are free blocks
When the file is removed free blocks are no longer in the linked list but they still exist

$ shred file

Attempt to shred the file and delete its contents permanently
The shred command overwrites the content of the file (3 times) with random data so it is tough to retrieve

Why random data rather than zeroes?

Depending on the media device, there could be traces of the old data. This makes the professor more paranoid!
When data is written onto the hard drive, dish head writes over what it thinks is the center of the sector
Special devices exist to read the "ghosts" of the old track
If you know what the new data is (e.g. zeroes) then it makes it easier to focus on and find the old data
If new track is random, it will be harder to see what the old data was
Analogy: Think of graffiti. It is much more difficult to see graffiti when it is painted over with other graffiti opposed to graffiti painted over by white paint (may be able to see traces)

Where does the random data come from?

/dev/random

"Window into the kernel"
Kernel maintains an entropy pool
Entropy pool is a system wide random bit buffer
Pool is filled with "random" external events (e.g. Low order timestamps from typed characters, network packets, etc.) but is a limited resource (can run out)
Cons: Entropy pool emptys quick, which makes shred run slowly

/dev/urandom

Similar to /dev/random but its entropy pool never runs out!
Uses a pseudo random number generator which creates fake random numbers
Recent Intel chips have created a new instruction called RDRAND
RDRAND fills register full of random bits obtained using very low level manipulations (e.g. CPU thermal energy and "noise") but can still fail!
Shred needs lots of random data
Cons: Fast but still too slow
Solution: Shred has its own high performance psuedo random number generator that it seeds with /dev/urandom!

US Federal Information Processing System (FPS) Standard on Deleting Data

Melt the drive (highest level of security)
Physically shred the drive
Degauss the drive
Overwrite the drive with random data (3 times)

Options 1-3 destroy the drive entirely, 4 does not
NOTE: These standards are old. For today's technology, it is not clear if this is enough

Assume we have written enough to overwrite the data - Does shred work?: - NO! Data can still be recovered in a log based file system

When you do a write, the actual data is not actually modified: - The write is posted in a log somewhere and doesn't update actual data until later

When you do a read, first check the log to see if there are recent changes: - If so grab from log

With this system, shred may overwrite something else and NOT the original data

So how do we remove a file and satisfy our paranoid professor?

Can still use shred, but instead try erasing one file at a time and erasing the whole file system:: $ su; $ shred /dev/rst/0

But there is a catch!: - Disks are flakey; - Blocks can be bad on occasion; - Disk drive manufacturer reserves extra area to recover from a bad block (using a bad block map); - For each bad block, the map will tell us where the good version is; - At the low level, we can recover data from looking at the bad blocks

Summary

You must be careful about the levels of a file system
It is important to know how it works all the way down to the device level in order to guarantee performance, security and other properties
You should know how your file system works at the machine level

Sectors

Traditionally 512 bytes long
A unit bigger than one byte for efficiency reasons
If too big, bus time will be wasted from grabbing sectors off the disk

Blocks

Most file systems pick a bigger block size than sector size
Bigger block size gives extra throughput
Typical value is 1 GB

Partitions

Take physical drives and divide them into pieces (e.g. an array of blocks)
Each partition can be treated independently

Above the File System Boundary

inodes: - Representation of a file; - Talks about a file no matter how you get to that file

File Name Components: - All parts of the file name except the '/'; - e.g. user, bin, and grep in /user/bin/grep

File Names: - Also known as "pathnames"

Symbolic Links: - Operate above the file name; - Abstract

Free Space in UNIX File System

Look at all the inodes: - If a block is not being pointed at then it is free; - All blocks - {blocks in use} can be an expensive computation; - Not practical

Need an efficient way to detect free space and where it is: - FAT: If entry is -1 then it is free space; - UNIX: How do we tell...

A solution for UNIX: BSD File System: + Fast to allocate storage; - I/O needed to allocate or free anything; - Cache in memory may not be same as what is on disk (constant checks are expensive); - Want to minimize seek time (average seek time is ~10 ms + rotational latency of ~8 ms); Note: When growing a file we want contiguous data which speeds sequential I/O if no other activity

What is NOT in an inode?

The file name
The parent directory
Which processes have the file open

Bad performance:: Must go right to disk whenever a file is opened; Kernel must keep track of which processes have the file open in main memory; Reboot clears all data anyway (don't want to clear later); Security issues

Why bother with hard links?

Want to skip renaming because it can be difficult
Want programs that can do useful work without doing renaming
UNIX supports hard links, but FAT does not

In the Original UNIX File System

Instead of:
rename("a", "b")

Use:
link("a", "b")
unlink("a")

To see an advantage of the second method, consider the following code:

link("d/a","e/b")
unlink("d/a")

In line 1, e is modified but does NOT change d
In line 2, d is modified

Each system call modifies just one file
The problem with the "rename" operation is that it modifies two different files at once: - This causes trouble because a single system call writes to two places on disk at once; - However the hardware won't do that, it picks one to do first and one to do second; - Vulnerable to crashes
The two system calls are less vulnerable as long as writing a section is atomic (despite a crash even)
This is why we use hard links!

Fun Stuff with Hard Links

$ git clone glibc glibc-copy

Makes a copy of entire history of every change made
Fast! Possible due to hard links
NOTE: Operation must be read only, will not work in write mode

A Typical Directory Entry (ext2)

Say we want to remove a directory entry...

This can be thought of as "unlinking" a file
Think of a directory as having varying length directory entries
We know how big it is
Unlink the directory by going to the previous directory entry and bump its directory entry link so it has more junk than it used to have
Create the directory by scanning the entry and looking for a big enough area of "junk" to replace
Creation and deletion is done relatively fast and can be done by writing to a single block on the disk!

Example Filetypes

Directory
Regular file
Symbolic link
Device

Looking at the directory entry will give you some information about the file, however the inode also has the above information (which is also cached). Be careful about this cache coherence, where Linux addresses by saying a file cannot change its type.