Scribe Notes

Operating System Principles
CS 111 - Spring 2013
Instructor: Prof. Paul Eggert

Scribe Notes - Lecture 13 (5/16/2013)
By: Deepak Ananth Rama

Topics:

Example - Virus checker update
Representation of Symbolic Links on a File System
Possible issues with Symbolic Links
File System Goals
Loss of Power
Lampson Sturgis Assumptions
Implementation of Rename in Berkeley File System
File System Invariants

Example - Virus checker update
Consider the case of updating a virus checker software on an file system. Assume that the current version of the software is placed in the file system as below.

What we want during the update:
1. An all or nothing update (no inconsistencies in the virus checker tree)
2. No issues on power loss
Suppose the following simple steps are performed:
        $ wget http://new-version.tar.gz               //1
        $ mkdir /new                                   //2
        $ (cd /new; tar xf) < new-version.tar.gz       //3
        $ mv /virus-checker /old                       //4
        $ mv /new /virus-checker                       //5
        The problem with the above steps is that between steps (4) and (5) no virus checker software exists on the system, leaving it open to attacks. In addition, a power failure after (4) may lead to inconsistencies. Therefore, we would ideally want to perform steps (4) and (5) together as an atomic operation, but no such enabling construct exists currently. A workaround would be to use symbolic links.

        Symbolic links are directory entries whose value is a filename. In other words a symbolic link has a filename as its contents they are files with filename for contents, i.e. their content refers to another file. The destination file may or may not exist, thus symbolic links may often be used to store string values.
        $ ln -s foo bar
        $ ls -l bar
          lrwxrwxrwx ...... bar->foo
        Here foo is the destination, and bar is the symbolic link. The file type l indicates that it is a symbolic file. System call symlink() is used to create a symbolic link.
        symlink(“foo”, “bar”)
        With an assumption that creation of symbolic link is an atomic operation, it can be used to do the virus checker update. Suppose v1 is the old version of the software, and virus-checker is created as a symbolic link to v1. It would have been simple to change the value of virus-checker to point to the new version. However, symbolic links are read only. Creation of symbolic links is atomic, but deleting and re-creating it is not.
        $ ln -s v1 virus-checker
        $ ls -l virus-checker
          lrwxrwxrwx ...... virus-checker -> v1
        A workaround is to create a new symbolic link, say new-virus-checker as a symbolic link to the new version v2, and then rename the new symbolic link new-virus-checker to the old link virus-checker. This works, as rename is always atomic. This ensures that the effective location of the current active virus-checker is maintained.
        $ ln -s v2 new-virus-checker
        $ mv new-virus-checker virus-checker
        $ ls virus-checker
          lrwxrwxrwx ...... virus-checker -> v2
Further, permissions don’t matter for symbolic link files, they are always accessible.
Representation of Symbolic Links on a File System
        How to implement symbolic links? Traditional answer is to use inodes, with the destination filename stored in a data block pointed to by the symbolic file’s inode’s first direct pointer as shown below.

Hard links can be created to symbolic links.

        $ ln -s a b
        $ ln b x
        $ mkdir d
        $ ln b d/y
        $ ls -l d/y
          lrwxrwxrwx ...... d/y -> a

        There are arguments that this wastes memory. A short circiuit answer is to use short symbolic links. Here the destination file name is stored in the inode itself, in place of the direct pointers. This limits the length of the destination file name, while preventing memory wastage.

        Symbolic links are always in the context of the directory. The below cat command fails as d/y points to a, which is expected to be in the same directory as y. d/y => d/a.
        $ cat d/y
          cat: d/y: No such file or directory
Possible issues with Symbolic Links
1. Hard links to symbolic links - No POSIX standard exists currently for its implementation
2. Dangling symbolic links point to junk or incorrect files
3. Symbolic links can be infinitely recursive. Consider the below example
  
          $ ln -s .. foo; find . -name foo
  
          Here a symbolic named foo is created for the parent directory of the current directory. This effectively creates a cycle in the virtual tree. Next a find is performed for the same filename foo in the current directory. The expected result is below
  
            foo                    //foo exists in current directory
            foo/self/foo           //foo points to parent directory
                                   //self is the current directory
            foo/self/foo/self/foo  //foo dereferenced to parent directory
  
           This infinite recursion does not occur - as find is programmed not to follow symbolic links. But the kernel uses simple substitution. Therefore, there are system calls that don’t follow symbolic links. For example stat() follows the symbolic link and lstats() does not. Also the flag O_NOFOLLOW can be used for the open() system call to not follow symbolic links. Another approach would be to track the directories traversed using an array.
4. In (c), the cycle was broken using new system calls. Another case is of two symbolic links pointing to each other.
  
          $ ln -s a b
          $ ln -s b a
  
          This too would be expected to go into an infinite loop if an open() system call is executed. This is prevented as the kernel limits a file name to have at most x symbolic links by depth. x is usually chosen to be 20. In such a case where links' depth go beyond this value, open() returns -1 as an error with the error code set to ELOOP. Note that using the concept of a trail of breadcrumbs will be fairly expensive to implement in this case.
5. Run out of disk space or inodes.
File System Goals
1. Performance: high throughput and low latency are desired
2. Changes (writes) must be deterministic: they must be either made on the disk, or not made (except in case of large file writes)
  
            char buf[1000000000000];
            fd = open(“file”, O_WRONLY);
            write(fd, buf, sizeof(buf));
  
  In large writes RAM based file systems do not give persistence of data in case of power failures.
3. Durability: file system expected to survive limited power failure, system crashes (or kernel panic) and disk (or SSD) failures. The first is more doable than making it tolerant to system crashes. Disk failures are considered in a later lecture.
Loss of Power
        Suppose you have opened a file in write mode to write the contents from a buffer into the file.
          fd = open(“file”, O_WRONLY);
          if(write(fd, buf, bufsize) != bufsize) error();
          close(fd);
        The above steps violate the golden rule of durability - “Never overwrite the only copy of your data”. The reasoning behind this is to prevent loss or corruption of data in case of a power failure. Since large writes are not atomic, a power failure midway through the write operation could possibly corrupt the existing data on the file, as well as cause loss of the data in the buffer.

        One possible solution is to use two files, say file1 and file2. During each write operation, overwrite file1 and file2 alternately. This ensures that at least the existing data in the file is recoverable. This approach should work in a simple environment, but there would be no easy way of identifying the old and new versions of the file. Let’s look at the various steps below, where A and B are the two copies.

        At any point of time we have either A or B on the disk. The problem here is that on a reboot case 2 and 3 are bad as we don’t know which of the two versions is correct. Timestamps would not help identify the correct version, as the junk data file might have a more recent timestamp.

        A more feasible solution involves having three copies of the open file. The steps are shown below. In any case, the copy with the majority is chosen as a winner and identified as the correct file to be considered. Since three copies are maintained, more storage is required and hence rarely implemented in operating systems.
Lampson Sturgis Assumptions
1. Storage writes may fail, or may corrupt other storage blocks
2. A later read of the storage can detect a bad block
3. Storage may spontaneously decay
4. Error are rare
5. Bad drives can be replaces in reasonable time
Implementation of Rename in Berkeley File System
Consider a rename() system call as below, it is not an atomic operation. The current process’s process descriptor is suggested as well.

rename(“a”, “b”);

The steps in the rename() system call are:
1. Read inode entry of the current working directory into memory
2. Read data block of the directory until we find entry for “a”
3. Change “a” to “b” in the memory
4. Write that data block onto the disk
5. Update timestamp in the inode of the directory
6. Write inode to disk
        Steps (d) to (f) are critical for successful rename() operation. Steps (e) and (f) if performed before (c) would result in wrong timestamp being recorded in the inode. Another issue wit with the length of the new filename. The directory entry would have been compacted to store the current filename a. If the new filename’s length is longer say bbbbbbbbbbbbbbb, the directory entry will have to be moved elsewhere. This results in an additional write() operation, update of pointer etc. More such writes only increase the risk of data corruption in case of a system crash or power failure midway.

        The rename() operation gets much more complicated if both the file and the containing directory are to be renamed.

           rename(“a/b”, “c/d”);

        There are two options to handle this, but both have issues if a system crash or power failure after the first step.
1. Update the directory name, and then the file name - A power failure after first step of directory rename leads to an effective file loss, as the file the renamed directory is not accessible with the new name. This is bad situation.
2. Update the file name, and then the directory name - A power failure after first step of file rename results in two hard links to the same file, i.e. the file is accessible in the same directory as either b or d. This is a much less serious case than the first and most operating systems implement this. It results in storage leaks, which can be fixed with a fsck() system call during reboot.
In the first case, midway we may have a situation as shown with both filenames pointing to the same inode. Though they would be similar to hard links, the link count in the inode is 1, as expected. On deleting the old filename the link count goes down to 0, resulting in the file data being deleted by the kernel, causing data loss. The fix for this is to perform the below steps. This however requires over four separate write operations and is thus expensive.
1. Increment link count
2. Update “c””
3. Update “a”
4. Decrement link count
fsck() can be performed without a reboot as well. The file system is first unmounted and then the fsck() system call is run. I also claimed that a fsck() can be performed without requiring the file system to be unmounted. We use the concept of invariants to handle issues such as loss of data and corruption of file system.
File System Invariants

File system Invariants are Boolean expressions that are always true. If a principle holds good before an operation such as rename, it should hold good even after the operation. These are a sort of design principles.

Invariants for Berkeley File system:
1. Every block is used for exactly one purpose (such as boot, super, bitmap, inode and data blocks)
2. All reference blocks are initialized to data appropriate for that type of data
3. All referenced data blocks are marked as used in the bitmap
4. All unreferenced data blocks are marked as free in the bitmap
Consequences of violation:
1. Corruption of data (disastrous)
2. Junk values in blocks (disastrous)
3. Correct data overwritten, and data corrupted (disastrous)
4. Memory leakage (problem, not disastrous)