CS 111 lecture 1 Winter 2004 midterm answers

1.  DMA won't help all that much, since DMA essentially is a
    throughput-enhancing device and throughput isn't important here.
    It might reduce latency somewhat, though, since the CPU can
    respond to another device more quickly if this device is using
    DMA.  Memory-mapped versus traditional I/O doesn't matter since
    memory-mapped I/O is simply a convenience to simplify the
    instruction set and has little or no affect on performance.
    Polling is a win, since it will reduce response time.  If we use
    polling, it won't matter how many CPU registers are available; but
    if we use interrupt-driven I/O, the extra CPU registers will hurt
    response time due to the overhead of saving and restoring them.

2.  There's no difference in behavior if we merely consider the stream
    of output bytes.  However, there are performance and buffering
    issues.  Clearly the pipeline will be a bit slower.  Suppose
    simple_sh reads one byte at a time, and suppose standard input is
    a slow device.  The user would observe that commands were executed
    in "bursts" rather than immediately after each newline was input.
    This is because "cat" would read 512 bytes and then send them all
    at once to "simple_sh", instead of having "simple_sh" read the
    input bytes one at a time.  (In practice this is not too much of
    an issue with Linux, since terminal input is line-buffered.)

3.  Have two processes each reading from a pipe, expecting the other
    process to write to that pipe.  Here's some code (with only
    rudimentary error reporting):
   
	#include <stdlib.h>
	#include <unistd.h>

	int
	main (void)
	{
	  int fd[2];
	  char c;

	  if (pipe (fd) != 0)
	    _exit (EXIT_FAILURE);

	  if (fork () < 0)
	    _exit (EXIT_FAILURE);

	  // Copy pipe input to pipe output.
	  // This code is executed by both parent and child.
	  for (;;)
	    switch (read (fd[0], &c, 1))
	      {
	      case -1:
		_exit (EXIT_FAILURE);
	      case 0:
		_exit (EXIT_SUCCESS);
	      case 1:
		if (write (fd[1], &c, 1) != 1)
		  _exit (EXIT_FAILURE);
	      }
	}

    One way out is to use non-blocking reads, and to report an error
    if no data are available.  This prevents the "no preemption"
    condition.

    Another is to modify each process to write before it reads.  This
    avoids the "circular waiting" condition.

4.  Here are some possibilities:

    a.  The application executes the equivalent of the machine
        instructions inside the implementation of the "unlink"
	system call (these will typically contain a "trap"
        instruction), without invoking "unlink" directly.

    b.  Some other function in the C library ("remove", say) calls
	"unlink", and this call was already resolved to a machine
	address when the C library itself was linked.

    c.  The application uses "fork" and "exec" to invoke some other
	program that does the unlink.

    d.  The application used dup2 (or something similar) to
	redirect stderr temporarily to some other file.

    e.  Some other thread, or a signal handler, could modify the
	file name between the time it's printed and the time it's
	passed to __unli.

    f.  The file that stderr was redirected to was on a file system
	that was temporarily full, so the writes to it failed.


5.  If remote memory access differs only in performance (even when
    atomic instructions are involved), then there should be few
    correctness problems when running an SMP-oriented OS on Red Storm.
    So, a port to Red Storm should be doable fairly quickly, but the
    initial performance won't be best.

    It's not likely that a NUMA architecture would support an
    arbitrary SMP-like application where a process contains many
    threads that access memory randomly.  Such an application won't
    map well to a NUMA machine: either the threads will each get their
    own CPU, which means they'll access a lot of nonlocal memory, or
    the threads will time-share a single CPU, eliminating parallelism.

    Note that big arrays that are accessed linearly (e.g., for many
    simulations) need not be local, since AMD claims that nonlocal
    throughput is about is good as local.  Only memory that is
    accessed "randomly" would benefit from being local.

    It might be tough to figure this out all automatically, so our
    best bet may be to tell programmers that they'll get their best
    results if they write their applications and threads so that they
    can be partitioned easily into mostly-distinct memory areas, such
    that each thread accesses (mostly) its own memory partition when
    it accesses memory at random.  We may need to add a system call to
    ask the OS for random-access heap memory that is intended to be
    "close" to the current thread; the OS can remember these requests
    and schedule threads and memory accordingly.

    The OS could balance usage by migrating threads from busy CPUs to
    lightly-loaded ones, and similarly could migrate memory (using
    virtual addresses to adjust to the migrated memory).  It may also
    make sense to have multiple copies of some objects (e.g., the C
    library), one in each CPU, to improve performance.  All this will
    require extensive reworking of the scheduler and memory manager.

PS.  The Red Storm architects looked at these issues and decided
to use GNU/Linux only in the roughly 300 service and I/O nodes,
with a special-purpose OS in the roughly 10,000 compute nodes.
For more about the Red Storm architecture and why it does not
use GNU/Linux as the compute-node OS, please see:

R Brightwell et al
A performance comparison of Linux and a lightweight kernel
Proc IEEE International Conference on Cluster Computing (CLUSTER'03)
(2003-12-01), p 251
http://csdl.computer.org/comp/proceedings/cluster/2003/2066/00/20660251abs.htm
http://www.csis.hku.hk/cluster2003/presentation/technical/4A-2.pdf
http://ieeexplore.ieee.org/iel5/8878/28041/01253322.pdf  (UCLA IP addresses)

$Id: midans.txt,v 1.2 2004/03/22 06:03:19 eggert Exp $