the following portion is edited by Vamsi Krishna Reddy Mannam
Example Drive:
WD Cavior SE16
7200 rpm
16 MB buffer
4.20 ms average rotation latency
8.33 ms rotation time
8.9 ms average read seek time (average seek time)
10.9 ms average write seek time (average seek time)
2.0 ms track-to-track seek (best seek time)
21.0 full stroke seek (worst seek time)
3.0 Gb/s transfer rate (between buffer and computer)
0.97 Gb/s (= 120 MB/s = 0.12 GB/s) transfer rate (from buffer to disk)
Example: read an 8 KiB block, randomly chosen.
read seek = 8.9 ms
rotate = 4.20 ms
trasfer = (8192 bytes)/(120000000 bytes/s) = 0.07 ms
total = read seek + rotate + transfer = 13.2 ms (i.e time to execute ~ 13.2 million instructions)
Example program:
line
for (;;) {
1. char buf[8192];
2. read (ifd, buf, sizeof(buf));
3. compute(buf);
4. write(ofd, buf, sizeof(buf));
}
Time for processing 8192 bytes in one iteration:
time for read(13.2 ms) + time for compute(1.0 ms) + time for write(15.2 ms) = 29.4 ms
Throughput processing 8192 bytes in 29.4 = (8192 bytes)/29.4 = 272 KiB/s
We need to tune the above code to improve throughput
We assume the following about disk read/writes while tuning
Spatial locality: Accessing block i means you're accesing block i+1 (or i-1) at the same time
Temporal locality: Accessing block i at time t means accessing block i at time t+1 or t-1
We use speculation to improve reads
SPECULATION
Improve performance by guessing what the application will do in future.
Assume we read 128 blocks at a time (~ 1 track)
Track read = seek (8.9 ms) + read whole track (8.33 = rotational latency + transfer rate) = 17.23 ms
Time for first Iteration of the loop = read(17.23) + compute(1.0) + write(15.2) = 33.43 ms
No more reads need to be done for the next iterations
Time for next iteration = read(0) + compute(1.0) + write(15.2) = 16.2 ms
Average time for an iteration = (33.43 + 127 x 16.2)/128 = 16.33 ms
Throughput = 8192 bytes / 16.33 ms = 490 KiB/s
We use dallying + batching to improve writes
DALLYING
Delay doing an operation in the hope of improving performance
BATCHING
Coalesce adjecent requests that can be done more efficiently together
Assume writing 128 blocks (~ 1 track). i.e, dally until we have entire track to write (and improving read using above method).
Time for first iteration = read seek (8.9) +read trasfer(8.33) + compute (1.0) + write(0.0) = 18.33 ms
Time for last iteration = read(0.0) + compute(1.0) + write seek(10.9) + rotational latency and write(8.33) = 20.23 ms
Time for iterations 2 to 127 = 1 ms each iteration for compute
Average time of an iteration = (18.23 + 20.23 + 126 x 1.0) / 128 = 1.28 ms
Throughput = 8192 bytes / 1.28 ms = 6250 KiB/s
Further improvement can be achieved by increasing the size of the buffer (ex: 256 blocks at a time). This will double the throughput
upto some point until a large fraction of the maximum disk transfer rate (120 MB/s) is reached.
Now consider multiple threads running:
- We want to schedule there I/O requests
- Want high throughput and no starvation
Simplifying Assumptions:
- Disk is linear array with a cursor
- Cost to do I/O is zero (for transfer)
Problem: We are given a set block #s: b0, b1, ..., bm-1 where m between 0 and some N, and we want to find a particular order
for bi's to maximize throughput and minimize starvation.
the following portion edited by Collin Nielsen:
FCFS (First Come First Serve):
- no starvation
- low throughput
SSTF (Shortest Seek Time First):
- always pick request closest to head location
- starvation
- highest throughput
-example: 10,11,3,10,11,11,10,11 (3 will starve)
A more common approach:
SSTF + FCFS
- break input into chunks
- use SSTF within a chunk
- use FCFS for chunks overall
This can be explained as the "Elevator Algorithm", where requests are handled like an elevator going up and down
The algorithm is unfair, requests in the middle don't wait long, but ones near the edges do
Solution for unfairness:
- put most-common data in the middle of the disk
- one way elevator: it only goes down, never up; circular elevator (worse throughput, but fairer)
Two threads reading block #s:
A: 1, 2, 3, 4, 5
B: 100, 101, 102, 103, 104
Timeline:
----A thinks(1 ms)
A: 1---2---
B: 100---101---
----B thinks (1 ms)
order of requests satisfied: 1, 100, 2, 101, 3, 102, etc.
what we want: 1,2,3,4,5,100,101,102,103 (<- apply dallying for a while)
This is called Anticipatory scheduling
File System Robustness Terminology:
- errors: mistake you made when you designed/built your system.
- faults: defect in your hardware or software that can lead to failure if exercised
- failures: when your implementation exhibits incorrect behavior (runtime issue)
File System Goals:
- performance
- atomicity - actions are either done or not done at all
- durability - file system survives some failures in the underlying hardware
Performance + atomicity conflict
last time: careful write ordering, so file systems survive crashes.
disk scheduling reorders writes! (SSTF)
Simplest fix: FCFS (but terrible throughput)
Can we do better?
idea: 4 invariants, require FCFS for metadata (superblock, bitmap, inodes), use SSTF for data
example: emacs ^s (save)
the wrong way:
open("f");
write; //can crash here
write;
write;
close;
The right way:
open("f~");
write;
write;
write;
close;
rename("f~","f");
This points to:
The Golden Rule of Atomicity:
Never write on your only copy
Corollary: simple atomicity requires atomic building blocks.