CS 111

Scribe Notes for Lecture 16 - 5/28/2008

By Allen Hair and Corey Bryan

Healthy Skepticism

"Don't trust the system completely because it can fail"

What Can Go Wrong?

Hardware Failures
People Failures
1. Screwups/Mistakes
2. Attacks

Hardware Failures

A Simple Example: Disk Failure

Consider a text editor's save function. How do we implement this?

The Simple Way:

open("f", O_WRONLY, ...); write(...); write(...); write(...); close(...);

If there is a power failure in between writes, you get a mixture of old and new versions!

The Golden Rule of Atomicity

Never write over your only copy!

Work Around Solution:

// Write new data to a temporary file open("/tmp/f", O_RDWR, ...); write(...); write(...); write(...); // Copy data from the temporary file to the intended file open("f", O_WRONLY, ...); read(...); write(...); read(...); write(...); read(...); write(...); close(...); close(...);

If the program crashes during the writes to the temporary file, then the old data remains
unchanged.
If the program crashes after the temporary file has been written, then we can get the
new data from the temp file

Another Problem: is write() atomic?

Consider: write(f, buf, 10000000)
This could not possibly be atomic! No disk write buffer is that large.

Can we implement an atomic write?

Multiple Solutions:

Get hardware support:
Writes to sectors (512 Bytes) are atomic
If we can't get hardware support:
1. We need to nail down hardware assumptions
  t0 ----------------------------> t
  Old ?? Garbage ?? New
2. Then write multiple copies
  
  Old Old
  
  ??? Old
  
  New Old
  
  New ???
  
  New New
  
  But wait! How do we know which value is correct after a crash and reboot?
  There is no tie breaker.
  Answer: We can't! We need the following:
  
  Old Old Old
  
  ??? Old Old
  
  New Old Old
  
  New ??? Old
  
  New New Old
  
  New New ???
  
  New New New
  
  Then the fixup after a reboot is as follows:
  If A==B==C then everything is correct
  If A==B then copy that value to C
  If B==C then backup that value to A
  Else Copy A to B and C
  
  This uses 3 times the space, but it is the best we can do if sector writes are not atomic.

If sector writes are atomic, can we implement atomic block writes?

Lampson-Sturgis Assumptions:

Writes may fail (e.g. bad media) but a later read can detect failed writes
Storage can spontaneously decay
Failures are rare

Using these assumptions they built a system with two hardware sectors for one user
sector. This is an example of a Fault Tolerant System

Fault Tolerant Systems

User does not notice performance effect of a single failure

- Expensive

Highly Available Systems

No single point of failure but performance might be affected after failure

+ Cheaper
- Not as robust

How do we implement this?

Idea #1: Commit Record

Indicate which changes are pending and then in the case of a crash, have the fsck utility
complete those changes after reboot

Example:

Write data to a temporary buffer (say blocks 100-109 on disk)
Wait for them to actually hit the disk
Write a commit record (to, say, block 110 (assume atomic block writes))
Copy blocks 100-109 from temp to actual file
Clear commit record

If crash occures before step 3, fsck does nothing and the write never happened.
If crash occures after step 3, fsck repeats steps 4 and 5 and the file is written.

This approach also survives crashes during reboots.

More Generally

Begin:

Pre-Commit Phase
(Can back out: crashes mean nothing has happened)

Commit: (or Abort)

Post-Commit Phase
(Transaction is done, you can also add steps to improve performance here)

Idea #2: Journal

From double entry bookkeeping: never erase anything

Keep a copy of mutated data on an alternate disk

Can be on the same disk, but it needs to be on a seperate part

Journal normally accessed sequentially

e.g.

Journal (optimized for writing)

Begin

Change

...

Change

Commit

End

File System "Cell Storage" (optimized for reading)

Recovery must be idempotent: Doing an operation twice must produce the same result
as doing it once

Cell storage could be in RAM!

"Main Memory Database"

Non-volatile is more common

Using a big (Write-Through) cache for performance

Two Kinds of Journals

Write Ahead Log

During Writes:	During Recovery:
Log your planned changes Log "commits" Install Changes	Scan forwards through log Look for uncleared commits Execute planned changes "Roll-forward" recovery

Write Ahead Logs work best if we're just blasting over data

Write Behind Log

During Writes:	During Recovery:
Log all OLD values Install all NEW values Commit	Scan backwards through log Look for uncommitted data Write old data back to disk "Roll-back" recovery

Write Behind Logs work best if old values are cached anyways

Media Faults

For High Utilization Disks:
Disk Failure Rate

RAID

Redundant Array of Independent Disks

RAID 0 - Concatenation

RAID 0 Diagram

Does not avoid faults
Striping doubles throughput

RAID 1 - Mirroring

RAID 1 Diagram

+ No Single Failure Point
- 2 Times the Cost

To Write to Disk:
Write to device 0 and to device 1

To Read from Disk
Read from device 0. If this fails, read from device 1

No fsck required.
RAID 1 Reads are faster (2 independent disk arms)

RAID 4 - Single Bit Parity

RAID 4 Diagram

Each bit in the parity disk is the XOR of the corresponding bits from each of the data disks.

To Write to Disk: Read the old value and the parity block
Calculate the new parity value
Write the block to the data block and write the new parity to the parity block

To Read from Disk: Read block you want.
If the block is bad, XOR all the other blocks and compare with the parity bit to find the
old value

Failure Rate of RAID 4 = Failure Rate of a Single Drive *
(Failure Rate of a Single Drive * Mean Time To Recovery)

RAID can be done in hardware (by a controller unit) or in software (by the OS)

Hardware is better for low end systems
Software is better for high end server farms. It is more customizable for a specific
environment and application.

Distributed Systems: Remote Procedure Calls (RPC)

Caller and Callee do not share address space (no pointer passing).
They also do not necessarily share architectures.
For example, the x86 is a little endian machine while a SPARC processor is a big endian machine.
Thus, it is necessary to choose a standard (They chose Big Endian)

The Marshalling Problem: (Also called serialization or pickling)

How do we efficiently convert an internal data structure into a bytestream to go over a wire?
See next lecture!

Old	Old	Old
???	Old	Old
New	Old	Old
New	???	Old
New	New	Old
New	New	???
New	New	New

Old	Old	Old
???	Old	Old
New	Old	Old
New	???	Old
New	New	Old
New	New	???
New	New	New

Old	Old	Old
???	Old	Old
New	Old	Old
New	???	Old
New	New	Old
New	New	???
New	New	New