CS 111: Lecture 10 Scribe Notes

Jeannie Nguyen & Emily Im FILE SYSTEMS:
We want to have good performance from our file systems, but how do we measure how "good" our performance is?

Metrics for Performance:

Latency: The delay from request to response time.

Throughput:The rate of request completion

Utilization:The fraction of time that the I/O system is actually doing I/O

- We can maximize the throughput by getting utilization as close to one hundred percent as possible.

Capacity: The amount of data the file system can hold

Transfer Rate: The speed at which the data travels from one file system to another. This is a lower level measurement.

Typically, better performance is characterized by low latency, high throughput and high utilization.

Our Example Device:
Assume we have a device with the following characteristics:

CPU Cycle Time: 1 nanosecond

Programmed I/O (PIO) : 1 microsecond (1000 cycles)

Send command to device: 5 microseconds (5 PIOs)

Device Latency: 50 microseconds

Computations per buffer: 5 microseconds (5000 cycles)

Read 40 bytes from device: 40 microseconds (40 PIOs)

The "Naive" Method
Now, let's consider trying to run the following code on this device.

for(;;)
{
      char buf[40];
      read 40 bytes in buf;
      compute(buf);
}

We break-up the calculations of the latency, throughput and CPU utilization as follows:
  Latency (microseconds) = time to send command (5) + device time (50) + read time (40) + time to compute (5) = 100 microseconds
  Throughput = 1 / Latency = 10,000 requests/second
  CPU Utilization = time to compute / Latency = 5 / 100 = 0.05 = 5%

Batching
The naive method above is simple, but performance is slow. Rather than reading in one byte at a time, we can read and compute them in batches. This increases our utilization.

for(;;)
{
      enum { BATCHSIZE = 21};
      read 40 * BATCHSIZE into buffer;
      for(int i = 0; i < BATCHSIZE, i++)
            compute(buf + 40 * i);
}

  Latency (microseconds) = time to send command (5) + device time (50) + read time (40*21) + time to compute (5*21) = 1000 microseconds
     -Though, the actual latency here is staggered.
     -Actual Latency = first buffer (950) + ... + last buffer (5)
  Throughput = 1 / Latency = 50 / 1000 (40-bit requests/s) = 21,000 40-bit requests/s
    -This is because each request has a size of 20.
  CPU Utilization = 105 / 1000 = .105 = 10.5%

Interrupts
We can further increase performance by using an interrupt handler with our batching process such that the IO access and computation runs in parallel.

for(;;)
{
      write command to device;
      while(!ready) continue;
      read data from devices
      compute;
}
do
{
      wait for interrupt (CPU stalls)
      handle interrupt
}while(!ready)

  Latency (microseconds) = time to send command (5) + wait for interrupt (50) + handle interrupt (5) + check ready (1) + read (40) + compute (5)                             = 106 microseconds
  Throughput = 1 / 56 (microseconds) = 17,857 requests/s
  CPU Utilization = 1/56 (microseconds) = 8.9%

Direct Memory Access (DMA)
Using interrupts leads to some improvement, but can create a bottleneck. We can avoid this bottleneck using direct memory access. Direct memory access allows the file system to perform computations while the controller accesses the memory directly.

for(;;)
{
      issue a write command to the device
      handle interrupt
      block until the interrupt
      check that device is read
      computation
}

  Latency (microseconds) = block(50) + handler(5) + check(1) + compute(5) = 61 microseconds
  Throughput = 1 / 11 (microseconds) = 91,000 requests/s
  CPU Utilization = 5/11 (microseconds) = 45%

Direct Memory Access with Polling (DMA)
Now, rather than blocking, we implement DMA using polling. This allows us to avoid the overhead from blocking interrupts. Polling, however, eats up more resources than blocking.

for(;;)
{
      issue a write command to the device
      handle interrupt
      while(DMA slots not ready)
        schedule;
      check that device is read
      computation
}

  Latency (microseconds) = block(50) + handler(5) + check(1) + compute(5) = 56 microseconds
  Throughput = 1 / 6 (microseconds) = 166,666 requests/s
  CPU Utilization = 5/6 (microseconds) = 84%

Method	Latency (microseconds)	Throughput(40-kB/s)	Utilization
The "Naive" Method	100	10	5%
Batching	~1000	21	10.5%
Interrupts	106	18	8.9%
DMA with Interrupts	61	91	45%
DMA with Polling	56	167	87%

Review of Disk Performance
Disk Latency comes from:

Seek Time:

Rotational Latency: Waiting for data to come under the head
Reading the data as it rotates under the head -> controller RAM
Transferring data from controller RAM via bus to desired location

Typical Disk Performance
Seagate Barracuda ES2 1TB ("nearline storage")

16MiB cache
Average Rotational Latency : 4.15 ms
Average Read Seek Time : 8.5 ms
Average Seek Time : 9.5 ms

1287 MB/s maximum internal transfer rate
116 MB/s maximum sustained transfer rate
3GB/s external transfer rate (SAS)
12.5 W typical operation
9.0 W idle
1.2 million hours mean time to boot time failure
0.73% AFR Anuualized Failure Rate (24x7)
10E-15 Nonrecoverable sector read failure rate

Be wary of the average seek time and the average read seek time. These have the potential to be large bottlenecks.

Basic Ideas for Improving Performance

Prefetching: This requires acquiring data from the device and placing it into RAM cache before an application requests it. This can be done within the application itself (such as in the batching example), within the OS, or in the disk controller.
Speculation:Preemptively do work in hopes that it'll improve performance later
-eg.Move the disk arm to the middle of an idle drive
Batching: (for Reads)
Batching: (for Writes) Cache information we want to write into the kernel. Then, fill up the buffer until it is entirely full before sending it to the disk.
-Downside: Data is not actually on persistent storage
Dallying: Rather than writing imeediately, wait a certain period of time first.This can save work when the file system is doing work for multiplie applications. By grouping related tasks together, we can avoid excessively moving the disk arm.

While each method benefits performance in some way, they do have their own disadvantages. Speculation can hurt performance if, for example, it moves the arm to a spot that is actually further from the desired location. Prefetching can waste RAM or tie up the disk unnecessarily.

Locality of Reference

Locality of Reference deals with the patterns within data requests, which can be used to predict future requests. We can utilize these localities to increase efficiency.

Spatial Locality: Predicts data access based on where they are located in memory. For instance, if we access the ith block, it means it is very likely that we will next access the i + delta block.

Temporal Locality: Access at a certain time t implies there will be access at time t + delta. Essentially , we can dally and wait for more requests then retrieve them all at once.

There is a key assumption that we are making here. Cache Coherence: We assume our caches of disk data are accurate. However, this gets increasingly difficult as the number of CPUs grows.

SOURCE: Image 1
SOURCE: Image 2