Lecture 14 Notes

by Alexander Nguyen and Philip Wu

The golden rule of atomicity

Never modify the only copy!

High-level implementation of an atomic operation

BEGIN
Pre-commit discipline (Non-atomic actions can occur.)
COMMIT (Action that is atomic at a lower level.)
Post-commit discipline (Non-atomic actions can occur.)
END

If the system fails before the post-commit phase completes, it can recover by finishing the post-commit discipline after a system failure. Recovery must be idempotent so that if the system crashes in the middle, it will recover again. Recovery must handle aborts as well as commits.

A system can achieve better performance by letting other accesses peek into the internals of the operation. A reader can get the old version from the main disk before commitment and the new version from a copy disk after commitment (but should not try during commitment). We can even give a reader the data from the source (instead of from disk). However, the reader can not commit changes to the system until a prior commitment has finished. Also, such a system can have cascading aborts (OSP Sidebar 9-3).

How to implement atomic writes

For disks, we can assume single-sector writes are atomic. Why? A disk head has enough capacitance to finish a write if a power failure occurs. For other devices, this may not be true. Sectors can be scrambled by a write during a power outage. One solution to this problem is to assign, for each data sector that is to have the all-or-nothing property, three physical sectors. (OSP 9.B.1) To service a request to write to a data sector, sequentially write to each of the three physical sectors. To read from a data sector, compare the contents of the first two physical sectors. If their contents are identical, return that value. Otherwise, return the contents of the third physical sector (OSP Figure 9-6).

Data state	1	2	3	4	5	6	7
Sector S1	old	bad	new	new	new	new	new
Sector S2	old	old	old	bad	new	new	new
Sector S3	old	old	old	old	old	bad	new

However, this design assumes that all three physical sectors are identical before writing. A previous failure can violate this assumption. To fix this bug, check and repair the sectors (by forcing them to be identical) before writing (OSP Figure 9-7).

The shadow copy strategy

To write a new version of a file atomically:

Write new version to a copy file system.
Write new version to the main file system.

Commit record approach

Write all blocks to a copy disk.
Write a commit record.
Write all blocks to the main disk
Clear commit record.

When you reboot, look for commit records. If you find one, recover by finishing the post-commit discipline.

Journaling

Instead of copying the whole file system, only copy parts being changed. Look at the journal for atomicity. Look at cell storage for fast reads. Look at OSP Figure 9-17. Basic logging protocol: Log the update before installing it. Run a recovery procedure after a reboot.

For an in-memory database, the journal is on disk and cell storage is in RAM (OSP 9.C.2):

Reboots are fairly slow.
Everything else is fast.
The database chews up RAM.

Write-ahead logging protocol

Log all intended actions.
COMMIT.
Install new data into cell storage.

For write-ahead logging, use roll-forward (redo) recovery:

Scan log in increasing time order.
Apply committed actions (not done) that you find.

Write-behind logging protocol

Log old data (already in file system).
Install new data into cell storage.
COMMIT.

For write-behind logging, use roll-back (undo) recovery:

Scan log backwards looking for uncommitted actions.
Write old data back to cell storage.

Reliability

A fault tolerant (FT) system can a survive a failure without drawing notice from users. A highly available (HA) system can survive a failure, but the system's performance may be degraded until recovery.

Durability

A measure of the length of time a storage medium remembers data.

Magnetic disk fault modes (OSP 8.E.3.1)

Write failure: A write to a sector fails. Failure is detected and reported to the operating system or application.
Decay failure: A sector goes "bad." Subsequent accesses fail.

RAID: Redundant Array of Independent Disks

RAID 0: Striping (OSP 6.A.4): Divide data among multiple disks. Reads and writes can concurrently access multiple disks. Transfer rate multiplied! Failure rate multiplied!
RAID 1: Mirroring (OSP 8.E.3.6): Copy data to multiple drives. Writes must be done on each disk. Reads can concurrently access multiple disks to increase throughput.
RAID 4: Dedicated parity disk (OSP 8.D.1): Dedicate a single disk as a parity disk for the other disks. Reading from a disk is direct. Writing to a data disk requires accessing the data disk and the parity disk, so the parity disk can be a bottleneck.

Suggestions for Further Reading

Randy H. Katz, Garth A. Gibson, and David A. Patterson. Disk System Architectures for High Performance Computing. Proceedings of the IEEE, Vol. 77, Iss. 12 (December, 1989), pages 1842-1857.