CS111: Operating Systems
Lecture 17 12/2/2013
Trevor Humphreys

Media Faults (SSD, Disk, and other media failing)

Media Fault is when the hard disk or SSD dies.
We want reliable crashes in case of power failure

How we can accomplish this

The problem with both of these techniques is, while they are good ways to deal with the problem of power failure, they aren't designed to deal with true media failure.

The best technique for dealing with media faults is redundancy

RAID (Redundant Arrays of Independent Disks)

Some example economics:

A 10TB drive costs say $1500.
A 1 TB drive costs say $80.
If you could buy 10 of the 1TB drives and combine them, it is almost half the price

Using RAID, we can configure multiple disks so users see only 1 drive.
This configuration is called concatenation. The Berkeley computer scientists invented a special disk driver to do this.

There is a performance problem with concatenation: access patterns dictate that a lot of the time one drive is doing all the work and the rest of the system just sits there (temporal locality.)

Because of this they came up with a different way of "gluing" drives together. Instead of just laying out the data end to end, split all the different parts up among different drives. This is called block-level striping
This gives much better performance because multiple disk arms can run at once.

A large problem: the reliability of these big drives is less than the reliability of small drives. If any of the small drives fail, the whole system will now fail.

Mirroring: all written data should be written to two separate physical disks.
Each block gets written twice. This will halve our available storage space, but will greatly improve reliability.

We can add different schemes on top of each other. Ex: Mirror at the bottom level, then stripe the level above that.
The different techniques are a tool kit for building a system that is large AND reliable AND quick.

Different types of RAID

What happens when you need to replace a disk in RAID 4?

Disk failure rates: initially high (manufacturing defects), then low for long period, then begins to rise again.

RAID 4 systems have way better reliability at low time and much worse reliability at high time.
If you have someone around to do repairs its great, because you can keep replacing parts and stay low failure rate area.
However, the odds of drives failing increases exponentially.

The overall reliability depends on the human factor, and also the recovery period.

Network File Systems (NFS)

Ex:

SUN ZFS Storage 7320 applicance

Throughput: 134140 operations per second (avg 1.51msec response time.)

This means a proper NFS is 4-5x faster than a local disk and can handle a huge amount of requests.

NFS Security

What can go wrong with a network file system?

For one thing, permissions problems -- what if we're reading a file and another user makes it unreadable?

Traditionally, the client kernel deals with permissions for NFS files, just like regular files.
But this indicates a security problem: you must trust the client kernel. An attacker with a bad client can give a fake user id and get access to other files.

There are a couple solutions to this in the NFS world:

Security

What's the difference between traditional security and computer security?
Attacks via fraud are more of a problem than attacks via force.
DDoS attacks can take you offline, but at least you won't compromise data.

Main forms of attack:

We want a system that disallows unauthorized service AND allows authorized access.

To keep out unauthorized users: Test with fake users, obviously bad clients, etc. But you won't really know if you are safe until your system is compromised.
To let in authorized users: A lot simpler, just make sure everyone can log in. They will tell you if they can't.

How to prepare against DDoS? Most of the time you don't. Ex: MyUCLA doesn't, because they are of the idea "who would DDoS us?"

Next Idea:
We have to think about threat modelling and classification.

Threats: Ordered by severity:

  1. Insiders
    • Most common form of breach: authorized users doing things they shouldn't have.
  2. Social engineering (Mitnick)
    • Mitnick was a famous hacker who broke into systems by pretending to be a repairman.
    • Actually, smooth talkers getting into systems is a big problem.
  3. Network attacks
    • DDoS
    • Drive-by Downloads (browser vulnerabilities)
    • Viruses
    • Phishing
  4. Device attacks
    • USB virus
    • etc

General Functions used for Defence: