File sytem Robustness

(scribe notes by Weijia Yu and Ziheng Zhou)


 

How to find bugs in file system


We can use Invariants: Boolean expression which is over your variables and are always true.


There are some invariants in our file system:

1. Every block is used for exactly one purpose. Our file system has blocks such as boot block, super block, bitmap block, inode block and data block.

    For example, we cannot write inode in data blocks


2.All referenced blocks are initilized with data appropriate for their type

    For examlple, used block can be reached from the root, and unused blocks cannot be reached from the root, in this way, users will not see random data. References are not pointed to blocks which files do not allocate.


3. All referenced data blocks are marked as used in bitmap


4. All unreferenced blocks are marked free

 
Consequence of violating invariants!(DISASTER):

     
1. If someone wrote over the bitmap

    You could lose data because used block might be marked as free. If someone writes a new file, the file system may allocate new data to the used blocking, thinking it was free to use.

 

2. If inode points to garbage indirect block

    Files will be overwritten by two different programs because the filesystem will believe that the blocks are still valid(security hole!)

 

3. If used block marked as free

    If a new file is allocated, this block might be used, so the data which is originally in this block is overwritten.

 
4. Some unreferenced blocks are marked as used

    There can be a data leak


Bug priority:

some of above bugs are more important, some of they are less important (maybe only do with performance).

 

Cache coherence issues

system will play a lot of game with cache to make it fast

    For example, if we enter echo /usr/bin/sh, for each "/", it needs to access super block, first data block and inode( 3 disk access)


we have three "\", which means we will have over 9 disk accesses only for "echo /usr/bin/sh".

Q:How to solve it?


A:We keep "\" in RAM, because people always use it.



Dallying

when users call write(fd,buf,bufsize), system will put a copy of your buffer to cache and immediately return. However, it did not really write.

Problem: if system crashes at this time (run out of power, system on the flash drive unplugged), your data will lose


Solution:

1.Use battery back up

    a. One of the battery back up methods is called UPS(uninterruptible power supply)

        problem: too expensive 

    b. Our computer has internal battery backup for emergency

        problem: capacitor is not enough


2.Change our invariants

    a.we can use sysnc(): tell system "DONT DALLY" until entire cache has been written to disk

           disadvantage: 1.it costs too much time to transfer the entire data. For example, we need about 24 second to transfer 10G data.

                                  2.You cannot tell if the transfer process has been finished

            advantage: easy to implement

    b. we can use fsync(fd): write all of fd's cache data to disk and wait for it to be done

        return number: 0->work; -1->did not work.

        disadvantage: it takes more time, because write(fd,&c,1), it accesses at least 1 data block and 1 inode block

    c.we can use fdatasync(fd): it only gurantee to save only data of fd, not meta data, and it only costs 1 data block, which means it is twice fast.

        return number: 0->work; -1->did not work.



However, these are too complicated to users, we want it to be invisible.


When we call    rename("d/a","e/b"), from file's perspective, it did not change. It is the directory the one changed.

The system need to:

    1.read d's inode

    2.read d's data

    3.read e's inode

    4.read e's data

    5.write d's data(add "b")

    6.write e's data(without a)

    7.write d's inode

    8.write e's inode

Problem: suppose we lose power at "6.write d's data", it will cause data loss

                what if we swap 5 and 6? No, wrong link count. When you rebbot the system, the link count will be 1 instead of 0


Solution: we swap 5 and 6 and add "write files inode(+1 link count)" after 4.

               then we add "write files inode(-1link count) "before 10.

               it becomes:

 

    1.read d's inode

    2.read d's data

    3.read e's inode

    4.read e's data

    5.write files inode(+1 link count)

    6.write e's data(add "b")

    7.write d's data(without a)

    8.write files inode(-1link count)

    9.write d's inode

    10.write e's inode



Problem: it is kind of slow, and it explains why our Weensy system is slow


Solution: we can combine the idea with cache and dally. In order to gain performance, just to tell me when 1-4 is done using the most efficiently order. 5-8 must be processed in that order. 9-10 can be done using most efficiently order.

It can be represented as:


Another problem with crashes:

If we crash in the middle of a write to a disk or flash.

For example, suppose Emacs has to operate on flash drive and we reboot. How to tell which is new or old?


Solution: we can have a commit block telling which is new and which is old:



Another problem: what if it loses power when we write to commit block.

Solution: we can use 2 copies of commit block:

1
A
?
B
B
B
B
B
B
2
A
A
A
?
B
B
B
B


Problem: we cannot decide it's state if two are different.

Solution: we use three copies

1
A
?
B
B
B
B
B
2
A
A
A
?
B
B
B
3
A
A
A
A
A
?
B

RULE: we pick majority. If all three disagrees, we choose anyone except garbage.