CS 111 lecture 1 Winter 2004 midterm answers 1. DMA won't help all that much, since DMA essentially is a throughput-enhancing device and throughput isn't important here. It might reduce latency somewhat, though, since the CPU can respond to another device more quickly if this device is using DMA. Memory-mapped versus traditional I/O doesn't matter since memory-mapped I/O is simply a convenience to simplify the instruction set and has little or no affect on performance. Polling is a win, since it will reduce response time. If we use polling, it won't matter how many CPU registers are available; but if we use interrupt-driven I/O, the extra CPU registers will hurt response time due to the overhead of saving and restoring them. 2. There's no difference in behavior if we merely consider the stream of output bytes. However, there are performance and buffering issues. Clearly the pipeline will be a bit slower. Suppose simple_sh reads one byte at a time, and suppose standard input is a slow device. The user would observe that commands were executed in "bursts" rather than immediately after each newline was input. This is because "cat" would read 512 bytes and then send them all at once to "simple_sh", instead of having "simple_sh" read the input bytes one at a time. (In practice this is not too much of an issue with Linux, since terminal input is line-buffered.) 3. Have two processes each reading from a pipe, expecting the other process to write to that pipe. Here's some code (with only rudimentary error reporting): #include #include int main (void) { int fd[2]; char c; if (pipe (fd) != 0) _exit (EXIT_FAILURE); if (fork () < 0) _exit (EXIT_FAILURE); // Copy pipe input to pipe output. // This code is executed by both parent and child. for (;;) switch (read (fd[0], &c, 1)) { case -1: _exit (EXIT_FAILURE); case 0: _exit (EXIT_SUCCESS); case 1: if (write (fd[1], &c, 1) != 1) _exit (EXIT_FAILURE); } } One way out is to use non-blocking reads, and to report an error if no data are available. This prevents the "no preemption" condition. Another is to modify each process to write before it reads. This avoids the "circular waiting" condition. 4. Here are some possibilities: a. The application executes the equivalent of the machine instructions inside the implementation of the "unlink" system call (these will typically contain a "trap" instruction), without invoking "unlink" directly. b. Some other function in the C library ("remove", say) calls "unlink", and this call was already resolved to a machine address when the C library itself was linked. c. The application uses "fork" and "exec" to invoke some other program that does the unlink. d. The application used dup2 (or something similar) to redirect stderr temporarily to some other file. e. Some other thread, or a signal handler, could modify the file name between the time it's printed and the time it's passed to __unli. f. The file that stderr was redirected to was on a file system that was temporarily full, so the writes to it failed. 5. If remote memory access differs only in performance (even when atomic instructions are involved), then there should be few correctness problems when running an SMP-oriented OS on Red Storm. So, a port to Red Storm should be doable fairly quickly, but the initial performance won't be best. It's not likely that a NUMA architecture would support an arbitrary SMP-like application where a process contains many threads that access memory randomly. Such an application won't map well to a NUMA machine: either the threads will each get their own CPU, which means they'll access a lot of nonlocal memory, or the threads will time-share a single CPU, eliminating parallelism. Note that big arrays that are accessed linearly (e.g., for many simulations) need not be local, since AMD claims that nonlocal throughput is about is good as local. Only memory that is accessed "randomly" would benefit from being local. It might be tough to figure this out all automatically, so our best bet may be to tell programmers that they'll get their best results if they write their applications and threads so that they can be partitioned easily into mostly-distinct memory areas, such that each thread accesses (mostly) its own memory partition when it accesses memory at random. We may need to add a system call to ask the OS for random-access heap memory that is intended to be "close" to the current thread; the OS can remember these requests and schedule threads and memory accordingly. The OS could balance usage by migrating threads from busy CPUs to lightly-loaded ones, and similarly could migrate memory (using virtual addresses to adjust to the migrated memory). It may also make sense to have multiple copies of some objects (e.g., the C library), one in each CPU, to improve performance. All this will require extensive reworking of the scheduler and memory manager. PS. The Red Storm architects looked at these issues and decided to use GNU/Linux only in the roughly 300 service and I/O nodes, with a special-purpose OS in the roughly 10,000 compute nodes. For more about the Red Storm architecture and why it does not use GNU/Linux as the compute-node OS, please see: R Brightwell et al A performance comparison of Linux and a lightweight kernel Proc IEEE International Conference on Cluster Computing (CLUSTER'03) (2003-12-01), p 251 http://csdl.computer.org/comp/proceedings/cluster/2003/2066/00/20660251abs.htm http://www.csis.hku.hk/cluster2003/presentation/technical/4A-2.pdf http://ieeexplore.ieee.org/iel5/8878/28041/01253322.pdf (UCLA IP addresses) $Id: midans.txt,v 1.2 2004/03/22 06:03:19 eggert Exp $