Spring 2013, Lecture 5 Scribe Notes

Orthogonality

The choice of z is independent of the choices of x and y. That is, you can think about z without worrying about x and y. The effect of our choice on x does not propagate based on our choice of z. Use orthogonality to limit the propogation of effects (i.e., effects should not propagate from one axis to another).

Primitives: fork() and execvp()

We can apply the concept of orthogonality to operating systems. In the last lecture, we discussed fork(), which clones a process. The general definition of fork() maintained by IEEE is as follows:

System independent
fork() should work on all UNIX-like systems
If you run fork() and it succeeds, the child's properties are equivalent to the parent's properties, with the following exceptions:
- different pids (process id's)
- different ppids (parent process id's)
- accumulated execution times
- pending signals
  - the child process doesn't have any
- file locks
  - if a parent has locked the file, the child doesn't get the lock because only one file can have lock
- file descriptors
  - they share file descriptions, but have different file descriptors

What we are trying to do with fork() is minimize the number of differences between the parent and child processes. We ideally want fork() to clone the two processes as much as possible because the less they differ, the more fork() will be orthogonal to the other things we do (invoke operating systems, etc.). We want to maximize orthogonality. We want to have one primitive that creates processes and we want it to be as simple as possible.

Another primitive is execvp(). A sample usage is: execvp("/bin/cat", ...). Think of execvp() as the inverse of fork() (roughly speaking). When a process runs execvp(), it keeps all the properties mentioned above (i.e., pids, ppids, etc.), and replaces a lot of properties that fork() preserves, such as:

program that you run
contents of the stack and heap
instruction pointer (ip)
core of process changes

In effect, we are trying up to come up with two different axes ( execvp() and fork(). The idea is that we can pick two different points on each axis and it should work. For example, consider 0 fork()'s and 27 execvp()'s. In this case, we would reuse the same process over and over again, without creating any child processes.

A New Primitive: posix_spawnp()

One can imagine a different universe in which you have a single system call that performs the actions of both fork() and execvp(). Wouldn't that be simpler? Why not just have one system call? We can then ask: is this orthogonality worth it, even after the complexity that would be introduced? POSIX has a primitive that allows you to create a new process and tell the OS that you want it to run, called posix_spawnp():

int posix_spawnp (pid_t * restrict pid, char const * restrict file, posix_spawn_file_actions_t const * file_acts, posix_spawnattr_t const * attrp, char * const * restrict argv, char * const * restrict envp)

pid is the process id
file is the name of the file you want to execute
file_acts contains the actions you want the subsidiary process to
perform before running (e.g., ignore signals)
attrp is the spawn attribute object
argv contains the same arguments you would pass in to execvp()
envp holds the environment variables
the int return value indicates whether or not the function was successful

posix_spawnp() incorporates elements from both fork() and execvp()

Derived from fork()
- pid
Derived from execvp()
- file
- argv
- envp
Comes from neither fork() nor execvp()
- file_acts
- attb

An Aside on restrict

restrict is an annotation on the C level. When you use restrict, you are indicating the pointer being passed in points to a piece of storage that no one else is pointing to. It avoids aliasing, which is a situation in which a data location in memory can be accessed through different symbolic names in the program. The purpose of using the restrict keyword is for optimation, so that the value passed in can be cached in registers. If you violate the restriction, the resulting behavior is undefined.

posix_spawnp() gets implemented as a library function, which then calls fork(), execvp(), and other functions (in Linux). The problem with this primitive is that it's hard to remember/use and isn't complete enough. posix_spawnp() is an example of non-orthogonality. It is better to have simpler primitives, like fork() and execvp() because they only pick one access and ignore all the others.

I/O

So far, we've covered several processes:

access to the CPU
- fork()
- _exit()
- getpid()
how to replace the program running in the process with some other program
- execvp()

Now, let's look at I/O (access to devices). There are several qualitative differences between I/O devices and the CPU.

I/O devices are slow compared to the CPU
- Some CPU overhead is acceptable
Robustness is also an issue with I/O devices. We most likely don't want applications to access devices directly.
Devices differ and there is a wide variety of devices, but all CPUs are similar
- All over the map: slow, fast, direct access, etc.

Since there is more variety, I/O devices have high complexity, so there's more need for abstraction in the OS to simplify access to the devices. Ideally, we want a simple, clean abstraction in the OS, in which we can access any device (just as simple as fork(), getpid(), etc.). However, we can't use the same system calls for all I/O devices due to the wide variety.

The major dichotomy for devices:

network, keyboard, mouse (getting, receiving bytes)
- data stream, sequentialized (no random access)
- spontaneous data generation (devices have a mind of their own)
- infinite number of bytes
disk, flash (storage)
- random access
- request/response situation
- limited capacity

We want to have a single set of system calls that deals with these kinds of devices. We may have to compromise a bit on system calls due to this dichotomy. Let's review a few system calls we've already seen.

I/O Primitives

ssize_t read(int fd, void *a, size_t bn);
- fd is the file descriptor
- a is the address you're reading into
- bn is the size of the buffer
ssize_t write(int fd, void *a, size_t bn);
off_t lseek(int fd, off_t o, int flags);

However, these system calls are not as orthogonal as we'd like. lseek() only works for the random access devices, but read() and write() work for both classes of devices. IEEE/POSIX, in order to get more efficiency, have done to these orthogonal primitives the same thing that they did to fork() and exec().

ssize_t pread(int fd, off_t o, void *a, size_t len)

You can think of pread() is to lseek() + read(), as posix_spawnvp() is to fork() + exec(). You combine two things into one, and you do it for performance reasons. You want to have a single system call rather than two.

These are the basic system calls that deal with I/O. But, we are missing an important element. What's the creation and destruction operation for I/O devices? We need two more primitives: open() and close().

int open (char const *file, int flags, ...);

Values you can assign for flags (permission checking is done when you open)
- 0: O_RDONLY -- Open for reading only
- 1: O_WRONLY -- Open for writing only
- 2: O_RDRW -- Open for reading and writing
- O_EXEC -- Open for execute only (not available in Linux yet)
- O_SEARCH -- Open directory for search only (not available in Linux yet)

int close (int fd);

You may be wondering why close(); returns an int. If fd = -1, close(-1) returns -1, and sets errno to EBADF for Bad File Descriptor. close() can also set errno to EIO for I/O errors. The lesson here is: always check the return value of close()!

A scenario: Create a file called grades in the home directory on the SEASNET Linux server. Initally, the file can be read and written by anyone. We don't want this to happen, so we change the permissions so that only we can read from and write to the file. Now, we can add the grades. However, there's a security hole. An attacker can still go in and modify the grades even though we've set the permissions because he/she can open the file before we change the permissions.

 
	$ touch grades
	$ ls -l grades
	-rw-rw-rw- 0 grades 
	$ chmod 600 grades
	$ ls -l grades
	-rw------- 0 grades

The idea here is that when you call open(), we have have a process table. If you call getpid() and it returns 27, we are at process 27. Somewhere inside this process descriptor is an array of file descriptors. A common size limit for the array is 1024. Each element is a pointer that points to a file description, which then points to the actual file. The point is, each process keeps track of the files that it currently has open. As you call open(), you start filling up the file descriptor table. By convention, the first three elements in the array are special.

fd[0] = stdin
fd[1] = stdout
fd[2] = stderr

When you call open(), if it succeeds, it finds the first file descriptor in the file descriptor array that currently isn't being used and starts using it. open() then returns the file descriptor, otherwise -1 on failure and sets errno to EMFILE.

Now, we have a primitive to remove a file descriptor entry from the table using close(), and to create an entry using open(). Are there any other primitives?

Consider this possibility. Suppose you want to run a program where stdout and stderr are sent to the same location. In a shell, you can do cat foo >file 2>&1. This outputs the file and sets stderr to be exactly what's in stdout. How could you implement this in the shell? Here's one possibility:

	int fd = open("file", O_WRONLY);
	int fd2 = open("file", O_WRONLY);

There are some downsides to this approach. One problem is efficiency. open() tends to be relatively expensive because it has to resolve the file name and look up the file name; we want to avoid this. Instead, let's use another primitive called dup(). dup() is to file descriptors as fork() is to processes. It clones the file descriptor.

	int fd = open("file", O_WRONLY);
	int fd2 = open("file", O_WRONLY);
	int fd2 = dup(fd);

In effect, we have two different entries in our file descriptor table that point to the same file description. One benefit of using dup(), rather than two open()'s, is efficiency. Another is that you avoid overwriting the same file. If you open the file twice and start writing to each file description, they maintain independent pointers as to where they are in the file. Each will keep track independently of where it's writing to. If you use dup(), they're shared. If one pointer appends and the other pointer appends, they won't overwrite each other's work.

What open() will do is grab the next unused file descriptor (say, fd = 28), and then dup() will grab the next unused file descriptor (say, fd2 = 30). However, we wanted fd to point to the file and fd2 to be a clone of fd. In this case, open() and dup() aren't going be quite enough. Consider the following:

 
	 close(0);
	 close(1);
	 close(2);
	 open("/dev/null", O_RDONLY);
	 int fd = open("file", O_WRONLY);
	 int fd2 = dup(fd);

We know that the first three files are closed. Then, we open a file that we know exists, /dev/null. It succeeds and open() returns 0. Then, we can issue the previous calls. We know that fd must equal 1, and fd2 equals 2, because they new files occupy the now empty spaces after we closed file descriptors 1 and 2. This could work, but we are altering stdin. Also, this code is not very efficient because open() has to resolve file names. There's another primitive, which is like dup(), but avoids the aforementioned problems. It's called dup2().

dup2() makes a copy of fd and remembers to copy fd2. For example, if you invoke dup2(3, 5), (if 5 is open initially, the system closes it first), it makes 5 a clone of 3. Now, we don't have to write filler code.

	int fd = open("file", O_RDONLY);
	dup2(fd, 1);
	dup2(fd, 2);
	close(fd);

Still, there are several things can go wrong. Suppose fd = 1. Then we will have an issue when we try to close fd. So, we must make sure that at the very least, fd isn't 1 or 2.

	int fd = open("file", O_RDONLY);
	dup2(fd, 1);
	dup2(fd, 2);
	if (fd != 1 && fd != 2)
	    close(fd);

What can go wrong with file descriptors?

use a file descriptor that is closed, i.e. write to a file that is closed (it fails, and errno = EBADF)
use a file descriptor that was closed and then another file opened to the same file descriptor (writing to the wrong file)
I/O error (errno = EIO)
forget to close a file and the file descriptor table fills up, known as a file descriptor leak (table will eventually run out of room, or the OS doesn't write to disk properly and program is not reliable)
read from a stream device, but there is no data yet (read() hangs (default), or read() returns an error)
read from a file on a random access device of finite size and you reach EOF (read() returns 0, by convention)
reading from a file and someone removes the file

Let's consider the last issue: Suppose you're reading from a file and while you're reading, someone removes the file. In effect, we have Process 1: cat file. While in the middle of this process, we have Process 2: rm file. The file is only half-catted. In Linux, the convention is that if you have a file open, and someone else removes it, the file isn't actually removed. For files that are removed, but still open, you can still do I/O to it. So, in our case, file is unaffected by the Process 2's rm call.

This suggests an interesting trick. Let's say, you create a temporary file, and you're trying to sort an amount of data that's so great that it can't fit into disk. So, you're going to read some data, sort it, transfer it to a temporary file, and then merge all the temporary files at the end.

	char name[100];
	int fd = open(name, O_RDWR|O_CREAT|O_TRUNC, 0666);
	if (fd < 0)
	    error();
	// unlink is the system call to remove a file
	if (unlink(name) != 0)	
	    error();

Now, we can do read()'s and write()'s, to fd. The system will not actually clean up the file unil we close it. If someone opened the file between the time we created it and the time we unlinked it, there will be a problem because we'll have both file descriptors pointing to the same file. Here's a correction we can make.

	char name[100];
	long pid = getpid();
	sprintf(name, "/tmp/sort%ld", pid");
	int fd = open(name, O_RDWR|O_CREAT|O_TRUNC, 0666);
	if (fd < 0)
	    error();
	if (unlink(name) != 0)
	    error();

Suppose /tmp fills up and we call do $ df /tmp. It reports that there are 0 blocks free. The program starts crashing because it can't create any temp files. We can try using $ du /tmp, which finds all the large files in /tmp. However, since the files are nameless, this won't work. We need a better way of creating temp files.

	char name[100];
	long pid = getpid();
	sprintf(name, "/tmp/sort%ld", pd");
	int fd = open(name, O_RDWR|O_CREAT|O_TRUNC, 0666)
	if (fd < 0)
	    error();
	compute(fd);
	if (close(fd) != 0)
	    error();
	if (unlink(name) != 0)
	    error();

The downside to this approach is that if our program is not reliable and crashes in the middle of the computation, the process leaves behind junk files in /tmp. The solution to this problem is outside the scope of the OS.

There is another problem. We are assuming that everyone follows the same naming convention. Suppose we are executing our command on a system like SEASNET, which has hunderds of users and some of those users do not follow this convention. We want to reimplement this so that it will still function correctly in less reliable environments. We can check to see if the file already exists under that name, and if it does, use another name.

	struct stat st;
	char name[100];
	long pid = getpid();
	sprintf(name, "/tmp/sort%ld", pd")
	// stores the file name
	char buf[100];
	// keep generating random file names
	do {
	// outputs a file name beginning with /tmp and containing a lot of random file bytes 
		after that
	    generate_random_file_name(buf);
	// check if the file exists using stat
	} while (stat(file, &st) == 0);
	// at this point, we know file doesn't exist, so open
	int fd = open(buf, O_RDWR|O_CREAT|O_TRUNC, 0666);

However, this program is still buggy because we have a race condition (a bug that can happen due to multiple processes, or threads, running in parallel). The race condition occurs right after we exit the do-while loop and have verified that the file doesn't exist. At this point, some other process can create the file. Then, we're going to create the file, too, and overwrite the other process's data. Now, two processes are fighting over the same file. With the primitives we've covered so far, this race condition is unavoidable. The only way to get out of it is to introduce a new system call or locking scheme. Luckily, there is another option you can add, O_EXCL. This flag means create this file exclusively. If it already exists, don't create it, and return an error message.

	// We can remove stat because the flag O_EXCL will serve the same function
	struct stat st
	char name[100];
	long pid = getpid();
	sprintf(name, "/tmp/sort%ld", pd")
	char buf[100];
	do {
	    generate_random_file_name(buf);
	} while (stat(file, &st) == 0);
	} while ((fd = open(buf, O_RDWR|O_CREAT|O_TRUNC|O_EXCL, 0400)) < 0 && eerno == EEXIST);

There are many other race conditions that we haven't addressed. Imagine two processes writing to the same file at the same time, or a scenario in which one process writes to a file while another reads from the same file. We will deal with these race conditions in the next lecture.

CS 111, Spring 2013

Lecture 5: April 16, 2013

Notes prepared by Grace Lin and Queenie Ma