CS111 Scribe Notes for Lecture 17 (May 30th, 2013)

by Zach North

Media faults (disk, SSD, or other media dies.)

We want reliable crashes.

How do we accomplish this?

The problem with both of these techniques is, while they are good ways to deal with the problem of power failure, they aren't designed to deal with media failure.

The standard technique for dealing with media faults: redundancy.

RAID (redundant arrays of independent disks) was invented in a famous Berkeley paper.

The naming of RAID is a little confusing (used to be "inexpensive" over "independent")

Some sample economics:

A 10TB drive costs say $1500.
A 1 TB drive costs say $80.
If you could buy 10x of the 1TB drives and combine them somehow, it's half the price.

Using RAID, we can configure it so users see only 1 drive.
This configuration is called concatenation. The Berkeley people invented a special disk driver to do this.

But there is a performance problem with concatenation: access patterns dictate that a lot of the time one drive is doing all the work and the rest of the system just sits there (temporal locality.)

Because of this they came up with a different way of "gluing" drives together. Instead of just laying out the data end to end, split all the different parts up among different drives (block-level striping.)
This gives much better performance because multiple disk arms can run at once.

Another problem now: reliability of these big drives is less than reliability of small drives. If any of the small drives fail, the whole system will now fail.

Mirroring: all written data should be written to two seperate physical disks.
Each block gets written twice. This will halve our available storage space, but will greatly improve reliability.

Aside: there's no law saying the individual hard drives can't be virtual drives...
So we can "layer" different schemes on top of each other. Mirror at the bottom level, stripe the level above that, etc.
The different techniques are just sort of a toolkit for building a system that is large + reliable + performant. Can be "tuned" to get what you want.

Different forms of RAID

If you're managing a RAID4 system and a disk fails, you replace the disk. What happens then?

Disk failure rates: high initially (manufacturing defects), then low for a long period, then begins to rise again (the graph looks like a bathtub).

RAID 4 systems have way better reliability at low t and much worse reliability at high t.
If you have someone around to do repairs its great, because you can keep replacing parts and stay in "low t."
But you wouldn't send a RAID system to Mars, because as time goes on the failure rate eventually becomes bigger than an individual drive. The odds of drives failing increases exponentially.

The overall reliability depends on the human factor, and also the recovery period.

NFS

This is the example benchmark Prof. Eggert gave in class.

Details:

SUN ZFS Storage 7320 applicance

Throughput: 134140 operations per second (avg 1.51msec response time.)

This means a proper NFS is 4-5x faster than a local disk (!) and can handle a huge amount of requests.

NFS was discussed in more detail last lecture.

NFS Security

What can go wrong with a network file system?

For one thing, permissions problems -- what if we're reading a file and another user makes it unreadable?

Traditionally, the client kernel deals with permissions for NFS files, just like regular files.
But this indicates a security problem: trusting the client kernel. An attacker with a "bad" client can give a fake user id and get access to other files.

There are a couple solutions to this in the NFS world:

Security

What's the difference between traditional and computer security?
Well, for one, attacks via fraud are more of a problem than attacks via force.
DDoS attacks can take you offline, but at least you won't compromise data.

Main forms of attack:

We want a system that both 1. disallows unauthorized service and 2. allows authorized access.

How to test 1? Try fake users, obviously bad clients, etc... but you won't really know if you are safe until your system is compromised.
How to test 2? A lot simpler, just make sure everyone can log in. People will tell you if they can't.

How to test against DDoS? Well... a lot of the time you don't. MyUCLA doesn't, becuase they take the attitude of "who would DDoS us?"

Which leads to the next point:
We have to think about threat modeling and classification.

Threats, ordered by severity:

General functions used for defense: