CS 111 Lecture 16 Scribe Notes - Spring 2012

by Henish Shah & Dony George for a lecture by Professor Paul Eggert on May 29, 2012

Continuing from the discussion from the previous class about the consequences of power failure, this class started with a discussion about disk failure.

One possible way to address this issue is to use logging. Whenever there is any change to the file system, this change is logged in a journal. When the system crashes, while recovering data, the system can check the journal and see which operations were being performed at the time of the crash and take appropriate measures to ensure that there is no corruption or incomplete operation. However, this issue would not work in case of a disk failure, as the journal itself is present on the disk.

RAID

Redundant Array of Independent Disks (previously: Redundant Array of Inexpensive Disks)
The original goal behind the concept was to save money (e.g. it is much cheaper to buy 5 1TB disks in comparison with 1 5TB disk)
However, over time the main goal has changed to gaining reliability

RAID 0 - Concatenation

RAID 0

The only goal here is to save money. For e.g., it is usually cheaper to buy 5 1TB drives in comparison to 1 5TB drive.
There is no redundance, just a simulated big disk
Data can be arranged in one of the two methods:
1. Sequential Read
  - Here the data is stored sequentially
  - Performance of a virtual disk is about the same as that of a physical disk + small overheads
  - On failure, easier to recover
  - Easy to add new disks (while upgrading the system)
2. Striping
  - Contiguous data is split into several physical drives. Hence data can be read in parallel from independent drives
  - Performance can be as good as n times better (where n is the number of disks)
  - It is however not easy to grow the disk, as the addition of a new disk can cause the striping to be rearranged

RAID 1 - Mirroring

RAID 0

The basic idea of RAID1 is mirroring the disk so that the system would not fail even if one of the disks fail.
Every write operation necessiates a write operation on both the disks
The read operation can be performed on either of the disks. Hence read is possible in parallel.
Since the reads can be performed from either of the disks, the performance can be optimized by performing the read from the disk arm which is closer to the actual data, thus decreasing seek time and improving read time.

RAID 4 - Parity

RAID 0

In RAID4, one of the physical disks is used as a parity disk
The parity disk stores the XOR of all the other disks. This way if any disk fails, we can recover the data by simpling computing the XOR of the other disks.
Consider there are 5 disks, A, B,C, D and P, where P is the parity disk while the rest are normal disks. The parity disk, P is calculated as A^B^C^D. If C fails, we can regenerate it as C = A^B^D^P
Reads a re similar to RAID0
Writes are similar to RAID1, since data has to be written to two disks for each write.
Drawbacks:
- If one of the disk fails, care must be taken to make sure that the disk is replaced immediatiely. Else there is a risk of losing all the data in the system.
- Since the parity disk, has be updated everytime anything is changed, that disk becomes a hotspot for activity.

RAID 5 - Parity with striping

RAID 5

This is similar to RAID 4, except that the parity bit is also distributed across the disks. This deals with the problem mentioned in RAID4 where the parity disk has a higher chance of failure due to the high activity occurring there.

Typical Disk Failure Characteristics

Based on the probability of disk failure provided by Disk drive companies, the Mean Time To Failure (MTTF) of a typical disk is around 300,000 hours or 34 years. However in reality, this is about 5 years. After 5 years, it is cheaper to buy a new disk rather than use the old disk because of the power gains provided by the new disk.

RAID 5

The above figure shows the Probability Density Function (PDF) of a normal disk failing. Interestingly, there is a high probability of failing in the initial period due to manufacturing error. This is followed by a relatively long stable period. As time progresses, this curve starts getting steeper.

RAID 5

The above figure shows the Probability Density Function for a RAID 4 system installed in space failing. Basically, it is a RAID 4 system where the disks cannot be replaced even on failure. Similar to the previous figure, there is a high probability of failing in the initial period. However, this is lower than the probability for a single disk due to the redundancy introduced by the parity disk. The probability of failure as time progresses increases up to 4x times the rate of failure for a simple disk.

RAID 5

The above figure shows the Probability Density Function of failure for a RAID 4 system installed in in a typical server room. In this scenario, it is expected that a operator would replace the disk as soon as it fails. Thus, after the five year period, the failure rate doesn't go up as steeply as it does in the case where we don't replace disks.

RAID would however not make disk backups seem redundant since RAID does not protect against user errors which could corrupt the disk.

Distributed Systems and RPC

Remote Procedure Call (RPC)

Wikipedia - "In computer science, a remote procedure call (RPC) is an inter-process communication that allows a computer program to cause a subroutine or procedure to execute in another address space (commonly on another computer on a shared network) without the programmer explicitly coding the details for this remote interaction. That is, the programmer writes essentially the same code whether the subroutine is local to the executing program, or remote. When the software in question uses object-oriented principles, RPC is called remote invocation or remote method invocation."
Caller makes call like
- x = fft(buf, n);
Actual Implementation
- send(buf, n) to server and wait for response
Callee and Caller do not share address space.
- Thus, we do not get the efficieny gains of Call by Reference
Advantage
- Caller and Callee may be on different architectures (eg. ARM and SPARC)
Drawbacks
- Some machines may be little endian while others may be big endian.

RAID 0

This problem is solved by marshaling.
- Marshaling and Unmarshaling is often automated.
- Generate stubs or wrappers to do marshaling -> Protocol Description + API Description
- In the above example, we are using one of the common forms, XML, to send data to the Callee
Failure modes are different
- Callee cannot trash caller's memory and vice versa (Hard Modularity)
  - System calls do not give the same hard modularity as RPC
- Since on Network :
  - message can get lost
  - message can get corrupted
  - duplicate messages
  - network can go down or could be super slow
  - server can go down and/or be super slow
- What should a stub/wrapper do:
  - If corruption then resend
  - If no response then
    - Keep trying. This could be suitable for idempotent operations (At least once RPC).
    - Give up and return an error. More appropriate for transactional operations (At most once RPC).
    - However, the user wants Exactly once RPC
RPC Examples
Example 1
- Client sends : "GET /foo/bar.html HTTP \r\n"
- Server responds : "HTTP/1.1 200 OK \r\n ________(some more stuff)"
- There is a protocol called SOAP which is built upon this. It specfies the *some more stuff* field above.
Example 2
- Say a client want to draw a pixel, it can send the following requests to the server:
- Server responds with "OK"
- This works even if they are on the same machine.
- You could also have functions that work on higher level primitives like, say, fill_rectangle()
Perfomance Issues with RPCs

RAID 0

You could have very long delays as we can see in the diagram

- Ways to resolve performance issues of RPCs
  - Higher Level Primitives - We can fix that if we use a higher level primitive. For this we would need to come up with an API in which operations are of the right size.
  - Asynchronous RPC - We can issue a bunch of RPCs back to back and then wait for responses to all of them. The responses could be in any order, since there is no guarantee of ordering due to unpredictable network behavior, different time to process at server.
    - Advantage: Better Performance
    - Drawback: Can complicate caller, since order may not be preserved.
    - Example
      - OLD : draw_pixel(x, y, color);
      - NEW : request = draw_pixel(x, y, color);
        wait_for(request);
      - In this case, the client has to know about the notion of outstanding requests. Thus, the abstraction offered by RPCs is somewhat weakened, since the client is now aware of an RPC call being made.
  - Cache in Caller - For simple stuff, not for long complicated procedures.

NFS Protocol

NFS is a file system built atop HTTP.
NFS can be approximated to a UNIX file system on wheels.
Example Operations (fh - file handle. Unique handle for a file within a file system)
- Lookup(dirfh(int),name(string)) : Response -> fh + attr (size, owner, . . . etc)
- CREATE(dirfh,name, attr) : Response -> fh + attr
- REMOVE(dirfh, name) : Response -> status
- READ(fh,size, offset) : Response -> data
- WRITE(fh,size, offset, data) : Response -> status
Want NFS to be reliable even if the file server reboots.
Hence, we choose to implement it as a stateless server.
- When a client does a write, it has to wait for a response before continuing.
- We could reissue requests if the server crashes.
- However, NFS server cannot respond until the data hits the disks.
- NFS would be slow for writes.
To fix this problem, we cheat
- Use flash on server to store pending writes.
- Writes don't really wait for server responses. If a write fails a later close will fail.
Using fsync and fdatasync would be slow.
In general, clients will not see a consistent state. NFS by design doesn't have read/write consistency for performance reasons. Though, it does have consistency for open/close, sync, fdatasync.

References

RAID Images - Wikipedia/RAID