File System

Jingzhi Yu | Qi Shao

Hybrid Data Structure
RAM + DISK

Performance
Persistence

Directory Entry

Directories are handled as normal files, but are marked in inode type as directory

Property

Multiple directory entries may reference the same Inode number (hard link)

Users identify files via pathnames a/b/c that are mapped to Inode numbers by the OS

If the path starts with /

A directory entry contains:

Size Content

32-Bit Inode Number

16-Bit Directory Entry Length

8-Bit Name Length

8-Bit File Type

N-Byte File Name

Size	Content
32-Bit	Inode Number
16-Bit	Directory Entry Length
8-Bit	Name Length
8-Bit	File Type
N-Byte	File Name

Inode

Each file is represented by an Inode

Property

Stored in Inode table in main memory, also on disk

Record metadate for file

Fix size per file

Identified by an integer number, not a name

It constains:

Content

owner

timestamp

permission

type(inmutable/fixed)

Note:
Inode:

Both directory entry and inode entry have file type
The type info in the directory entry is cached from the inode entry

1: find . -type f

2: find . -mtime +1

2 is slower than 1 because it has to go to the inode instead of the directory entry

Content
owner
timestamp
permission
type(inmutable/fixed)

Layers of abstraction

The following are layers of abstraction in a file system design(up to bottom):

Symbolic links, shortcuts in the file system graph
File and Directories, if is directory, it can point to other directories
Inodes
Blocks (usually a power of two multiple of sector length, say 8 KiB)
Sectors (fairly small, say 2048 bytes)

Remove a file

unlink("foo")

It is possible that you have multiple directories for the “foo” file(hard link). Even if one of the links has been removed, others still exist, which means link count is still non-zero.
Even if there is only one link, the run is still not complete. Considering the race condition, before you remove the file, somebody could create a copy of the file like below:

$ ls -l foo
$ ln foo bar   
$ rm foo

So, there are two main problems in terms of removing a file in the file system:

Others may link to the file or open the file already, so your deletion might not help at all.
Removing file doesn’t destroy the content on the disk; instead, it simply marks the data blocks as free.

So, what should we do? Of course, we don’t want to physically destroy the disk.

One method could be write over the file with random numbers and then remove it. Actually there is a command called shred for this purpose. If you want to make sure the file is destroyed, you can shred it multiple times. In most cases, this is enough.

Shred doesn’t work well in log-structured file system. So when dealing with this kind of file systems, remember to shred the file system itself!

$shred file
- open file
- set target size
- overwrite the file with random data
- close file 
- unlink file

How to interpret file names?

The internal mechanism of interpreting file names is called “namei”. Basically, it establishes a map from strings given by users to corresponding devices, inode pair.

For example open("a/b/c", ...), the OS will start at the working directory (for a relative path) or the root (for an absolute path). It does a left-to-right traversal over a file name, exploring directories and keeping track of the current directory’s inode. From the inode, we can get the directory entries to look up the next file name component and continue down the tree.

Directory Hierarchy

For example, every file has a inode number. After exploring a/b, the current directory inode will be b’s inode, whose number is 1017 and the file name component to resolve is c. The new device, inode pair will then attempt to be resolved.

If a file is missing during this traversal process (for example, if c is nowhere to be found once a is the current directory), then errno is set to ENOENT. If search permissions (the x bit) are missing on a directory to be explored, errno is set to EPERM.

Note:
Every process has a working directory, even with the same command $ cat a/b, the contents printed on the screen can be different as the working directory may vary.

Create a file

$ open ("a/b/c/d", O_WRONLY|O_CREAT|O_EXCL)

Note:
Last component in the specified path - in this case is “d” - does not necessarily exist. For example, link("a","b"), file “b” does not have to exist.

Link:
link count = count of # directory entries pointing at that inode

A link is just a pointer to an inode. A directory is an inode that holds links. Each filename in a directory is just a link to an inode.

Hard links to directories in Linux are not allowed. One potential problem it might cause is creating a cycle. For instance:

$ mkdir a b
$ touch a/f b/g
$ ln a b/l
$ ln b a/k

enter image description here

In this case, if we allow directory a to point to directory b and directory b to point to directory a, two different directories in different points in the filesystem are pointing to the same thing. In fact, a subdir could point back to its grandparent, creating a loop.

Why is this loop a concern? Because when you are traversing, there is no way to detect you are looping (without keeping track of inode numbers as you traverse). Thus, allowing hard links to directories would break the directed acyclic graph structure of the filesystem, creating directory loops and dangling directory subtrees.

Space allocation for files

Recall the structure of an inode entry. There are uid, timestamp, and other information of the file. Following the fixed info, there are direct block pointer, indirect block pointer and maybe indirect2 block pointer.
Indirect blocks in an inode are used when the file size exceeds the amount of data that can be stored in the very limited direct blocks. The indirect block stores addresses to blocks where more data is stored. An indirect block can store addresses of BLOCK_SIZE / ADDRESS_SIZE blocks. When the first indirect block has been used up, the file system will allocate the doubly indirect block, which stores addresses to more indirect blocks that store addresses to data blocks. See the diagram below.
enter image description here