Operating Systems Principles Lecture 16 Scribe Notes (Winter 2013)

by Vincent Tai, based on Paul Eggert's lecture of 2013 March 11

RPC Performance Issues
1. Speedups
Network File Systems
RAID
1. RAID 0
2. RAID 1
3. RAID 4
4. RAID 5

RPC Performance Issues

Procedure calls take a long time, due to network latency: it takes time for messages to move between a client and a server.

Ways to speed this up

The simplest way is to have the client and server closer together.

Pipelining is another method: instead waiting for a response before the next request, send multiple requests at a time. Pipelining does not always work however: there may be dependencies, e.g. GET, PUT, GET in that order may be dependent on each other, so they cannot be pipelined. Pipelining works best when requests are idempotent, i.e. doing a request multiple times has the same result as doing it once.

Caching is another way to improve performance: After receiving a response, the client saves it, and uses the saved answer if the same request is made again. A problem with caching is that the cache may become stale. There are a number of solutions:

An "Expires:" field in the header. If the cached answer is expired, the client should request a new answer from the server.
Push protocol: the server notifies interested clients of a change. This requires that the clients must be working at the time of a notification. It is also more work for the server.
Prefetching: the client guesses what the user will do, and sends a request before the user does it. An example of this is a browser guessing what the user will click on next and fetching it.

Network File Systems

$ cd /net/foo/bar/baz
$ cat file
(gets contents of /bar/baz/file on machine foo)

How does this work?

From the operating system's point of view, there is a mount table: every time an inode is looked up, the mount table is searched to see if the inode maps to another file system. The mount table contains, among other things, the inode numbers and file systems of the inodes being mapped from and to.

Problems with mounting include the possibility of looping (e.g. A is mounted on B is mounted on A). Traditionally, this was not a problem because mounting the same file system twice was not allowed. More recently however, there have been "loopback file systems" (lofs) that enable loops.

Another problem is security. If anyone could mount, then an attacker could, for example, mount another file system over /etc/, replacing the passwd file and possibly gaining root access. As a result, mount is a privileged system call, so only root can mount.

VFS Layer

In Linux, file system operations are implemented via the VFS (Virtual File System) Layer. For each file system there are the structures "struct file_operations" and "struct inode_operations" in the kernel that contain pointers to functions to do operations such as read, write, unlink, rename. Whenever these system calls are called, the kernel executes the function supplied by the file system implementation. This is essentially object-oriented programming in C. (If the kernel Linux was written in C++, there would just be a class with member functions, instead of function pointers.)

NFS client code is no different: it is inside the kernel, as a file system implementation in the VFS layer.

NFS Protocol

What should the network protocol look like? Since VFS layer functions need to be implemented, the NFS protocol is an RPC protocol in which requests look suspiciously like Unix system calls.

Examples:

MKDIR(dirfh, name, attr)
REMOVE(dirfh, name)
LOOKUP(dirfh, name) (this is like "stat")

File Handles

What are file handles (the "dirfh" in the above example)?

System calls use file descriptors: small per-process integers given by the kernel.

File handles are like file descriptors, but we want client NFS to survive server crashes. Because of this, we can't rely on file descriptors, as the server would lose them after a crash. We need a way to refer to a file that survives crashes.

A file handle is basically an inode number + a device number, which refers to a file on the server. The client doesn't know details of the file handle: it's just a "magic bit pattern". The server, however, sees the file handle and can interpret it as a file.

Problem: deleted files

In traditional Unix, files with link counts of zero are retained by the kernel until they are closed. This will not work on NFS, since NFS servers are "stateless", so they don't retain information about whether a file is open.

To substitute for this behavior, clients cheat: if the link count would drop to 0, but the client has the file open, the client renames the file to a name like ".nfsQR9632" that is unlikely to be used. Of course, if the client crashes, this file will not get deleted. In addition, doing "ls -al" will show these hidden files.

Synchronization and Consistency of NFS

Client 1:
t = gettod();
write(fd, t, sizeof(t));

Client 2:
t = gettod();
read(fd, t1, sizeof(t));

The order in which the requests arrive matters. However, the requests can arrive in any order due to network effects. Even from a single client, the server might receive requests in an order different than the one sent by the client.

As a result, NFS does not have read-to-write (or write-to-read) consistency.

NFS does guarantee close-to-open consistency, however. When the client calls the "close" system call, it waits for the server's responses. All that "close" does is make sure that outstanding requests get done before it returns. Also, note that close(fd) can return -1 with errno==EIO if a write fails. This does not happen in normal file systems, because write failures are always returned by the "write" call.

NFS does have a "synchronous write" flag, which basically does sequential writes, waiting for a response before doing the next action. However, this is not done due to poor performance. Possible inconsistency is the price of performance in NFS.

RAID: Redundant Arrays of Independent Disks

Since disks can fail, in order to prevent data loss, some redundancy is needed. There are several forms of RAID, and it is possible to combine different forms. The basic forms covered in class follow. See Wikipedia for more information.

RAID 0

RAID 0 does not actually provide redundancy; it merely treats multiple disks as one large disk.

There are two forms of RAID 0:

Concatenation: multiple drives are concatenated into one long drive, e.g. data blocks arranged as follows: ABCD|EFGH|IJKL
Striping: consecutive blocks are stored on different drives in order to allow parallelized sequential reads, e.g. data blocks arranged as follows: ADGJ|BEHK|CFIL

RAID 1

RAID 1 is the simplest form of redundancy: the data is simply mirrored, i.e. there are two exact copies of the same data. If a disk fails, the data can be recovered from the remaining disk. Advantages of RAID 1 include simplicity and ease of implementation, but a disadvantage is that is costs twice the disk space of the stored data.

RAID 4

In RAID 4, there is a parity disk. For example, there are five disks: four (A, B, C, D) store normal data, and one is a parity disk, containing the xor of the other disks (A^B^C^D). If a disk fails, its data can be recovered by xoring the data of the other disks.

One disadvantage of RAID 4 is that the parity disk can become a bottleneck, since writes to any of the other disks also result in a write to the parity disk. In addition, performance will suffer after a disk failure (since reading from the bad disk will require reading from the remaining good disks and xoring the data). Such a system is called "highly available", but it is not "fault tolerant".

RAID 4 is the form that SEASnet uses.

RAID 5

RAID 5 can be described as simply RAID 4 with striping: the parity data is distributed among the data disks.

Note that with RAIDs 1, 4, and 5, it is assumed that at most one disk will fail at a time. If two disks fail, data will be lost. In order to be resistant to more disks failing, more redundancy is needed.