Scribes: Alexander Lee and Ian Yarbrough
Lecture Date: 02/29/12
These next few concepts are overflowed from the previous lecture on File System Implementation
Issue with file system organization
Interactions between inodes have to go through multiple levels of indirection before getting to the actual data block. If we call an lseek(), we may actually be calling multiple lseeks each time we read the indirect indix from disk. This leads to unpredictable performance, a lot slower than originally percieved
To prevent this, many real time unix systems have a different kind of file called "contiguous files". Within each inode there is a bit that indicates whether the current inode
is contiguous or not. In contiguous inodes, instead of having direct and indirect blocks, there is a pointer to a contiguous region of data on disk. This is similar to the RT11
file system, but this file system has the capability to support both contiguous and ordinary UNIX files.Since it is simply just a big chunk of disk, there would be no need for
using multiple lseeks()
To open a contiguous file, you simply specify that is is contiguous by specifying the flag. In the example below we specify the size of the file to be 1 GB.
open("file", O_CREATE | O_RDWER | O_CONTIG, 0644, 1024 * 1024 * 1024);
What's the downside?
Two functionalities
mknod("/dev/null", magic numbers);
This system call, mknod, would be found in a boot script. The system call mknod() is able to create a "special" file with the
correct parameters. So now when using a ls-l /dev/null
it will return:
A named pipe is similar to a regular pipe, except that it the scope of the pipe is longer than one process, also needs to be deleted. How to create a named pipe:
$ mkfifo /tmp/mypipe
$ ls -l /tmp/mypipe
prw-rw-rw- ... /tmp/mypipe
The p indicates that the file is a named pipe, examples on how to use the pipe:
$ cat /tmp/mypipe > foo & //Here the cat is reading from the pipe
$ sed s/a/b/ /etc/passwd > /tmp/mypipe //Here sed is writing to pipe, cat will then read
When the above commands finish, foo will contain a copy of the password with b substituted for a.
The named pipe has an inode because it is looked up by a directory entry just like a regular file, but there is no data stored on disk. The pipe is in RAM.
We have essentially "hijacked" the filesystem code to be used for a different purpose
Suppose we have two file systems on our machine.
/usr is a directory within the root directory which points to a completely different file system. One way to not confuse the two systems would be giving each file system a specified number
In UNIX this number is of type dev_t. In the inord, there is both an inode number (ino_t) and also a device number (dev_t) which tells which file system the inode is associated with.
Almost no real file systems use this method! Why? Suppose:
$ln /usr/bin/sh /bin/sh
There are two issues:
A better way to get different file systems would be to have a mount table in the kernel
dev_t ino_t dev_t 1 2090 29
This maps the inode 2090, containing device 1, to the root directory of the file system with device number 29; this helps keep track of active mounted file system. This solves the link count problem,
since we don't need to be able the change the read only file systems link count, we just adjust the mount table located in main memory.
These are competing goals, for example:
We have 10 blocks of data on disk. We have 10 blocks of new data sitting on RAM. To do a save, we could just write the 10 blocks in RAM onto disk one at a time.
This operation may or may not be atomic. If only one block was changed, and block writes are atomic then the save operation is atomic, otherwise it is not.
One way to ensure atomicity is by writing the memory buffer to a different temporary region in desk.
| old data | metadata
| new data | metadata
The new data may be partially written if the computer crashed in the middle of the operation
Golden Rule of Atomicity never overwrite your only copy on disk!
We could imagine a file system in which half of the disk is reserved for storing temporary results.
Problem:
How do we tell which data on disk to use? We need a "switch" function that tells which data to use. There will be meta information associated with each file to tell which of the two data to use. But how do we update the meta data atomically?
Now we must make assumptions for the underlying support for atomic update of metadata, we will make each assumption as simple as possible
old state ----> ???? ----> new state
t -------------------->
1 | old | ? | new | new | new | new | new |
2 | old | old | old | ? | new | new | new |
3 | old | old | old | old | old | ? | new |
After a crash, scan all 3 blocks and choose the majority file to determine a solid state that isn't metadata. If they all disagree, as in the fourth case, use the first block.
Not many file systems actually use this, because of performance factors (slow). The assumptions we made earlier are very conservative in nature, normally we use the assumptions below.
write(fd, buf, 8192); //This write also has to be aligned, must start at beginning of block
rename("a", "b");
When renaming a file you need to write to the disk where the dentry is contained. You also need to write to the disk where the inode is currently contained to update the time stamp. Since we are
dealing with two separate blocks the operation is not atomic
Now what would happen is we tried this:
rename("d/a", "e/b");
One way to do this is to put a new directory "b" in "e", then delete "d/a". This way you won't lose the file if it crashes in the middle, but there is still a problem.
You could end up with both "d/a" and "e/b", but only a link count of 1. If you remove one of the two directory entries, the link count drops to 0, the storage is reclaimed,
and the other directory entry is now invalid.
d's inode data block e's inode data block file inode ____________ _______________ ____________ _______________ ____________ Before |__|2010|____| |__| a | 23 |___| |__|2009|____| |_______________| |___|1|______| ____________ _______________ ____________ _______________ ____________ After |__|2012|____| |_______________| |__|2012|____| |__| b | 23 |___| |___|1|______| In this figure it shows the before and after states of the inodes and datablocks when a rename occurs. D and E's inode both have the timestamp of the a and b, and these inodes point to the respective datablock that has the information of a and b. Within these datablocks lie the actual filenames and the address of the file it points to. The actual file inode shows the link count of the file, in between the states the link count can drop to 0, which is the problem described earlier.
One approach would be:
One issue with the above approach is that if we crash inbetween 1 and 2, or 3 and 4, the link count would be too high, which then leads to a possible storage leak.
This is still the better than the first option. Systems routinely run clean up processes, fsck, which finds leaks and reclaims memory, resolving the link count issue.
It is the better option than simply crashing