Lecture 17

By Thomas Berger

Dealing with Media Faults on NFS Servers

Media faults occur when blocks on a server's disk (HDD, SSD, etc.) become corrupt. When a crash occurs we need to ensure that our system remains reliable. This reliablity can be determined by our disks ability to retain the correct data. Common causes of hard drisk crashes include the existence of corrupted files, power outages and excessive heat. In order to address these problems servers often have hard disk and battery backups. Although journaling file systems are tuned for system crashes and power outages they are not applicable when a media fault occurs. This is because we have to assume that we have a functional disk for the journal.

A standard techinque for dealing with media faults is redundancy. RAID (Redundant Array of Independent Disks) is used to implement this redundancy. The motivation for the invention of RAID was to reduce the cost of increasing disk space.

RAID 0: Concatenation and Striping

Concatenation is a collection of disks that are linked together in a linear fashion. When writing to the virtual disk, the system will write to the first physical disk until it is full and then move on to the next physical disk. Concatenation has performance issues since we are only writing to a single physical disk.

Striping is an alternative to concatenation that writes data across all physical disks instead of sequentially writing to one. Striping increases performance since multiple segments of data can be accessed concurrently.

RAID 1: Mirroring

With mirroing, data is writtn identically to two physical drives thus producing a mirrored set. You take each virtual drive and map it to multiple physical drives. Mirroring increases read performance since you can pick the disk whose disk arm is closer to the data that you're trying to obtain. On the other hand write operations are slower but not 2x slower, just the worst case of the two writes. Mirroring costs twice as much and uses just about double the power.

RAID 4: Concatenation with a dedicated parity disk

A parity disk is a hard drive that provides fault tolerance. It is implemented using exclusive or (XOR). The XOR of all the data drives is written to the parity drive. If one of the drive fails, the XOR of the remaining drives is equal to the data of the failed drive. Therefore all we have to do is copy the XOR of the remaining drive to a new drive and our system will be recovered. When a drive fails there is a recovery period which is equal to the notification period + the replacement time + the time to copy the repaired blocks. If we try to access this failed disk while in the recovery period our disks will be SLOW and performance will be degraded. If another disk fails during this period then we are toast. Downside of RAID 4 is that the parity disk is a hotspot

RAID 5: Striping with a distributed parity disk

RAID 5 is the same basic idea of RAID 4 but uses striping for parity. It distributes parity along with data across all the disks. Downside of RAID 5 is that growing the drives is difficult since you have to reogranize all of the parity stripes across the new disk.

Disk Failure Rates

Disk failure rates tend to be high closer to the creation time because of manufacturing errors. As time increases failure rates increase because the increase in disk use wears the disk out. A difference in price of a server drive vs a desktop drive could be the fact that the manufactuers test the disk more before sale. The cumulative distribution function (CDF) shows the probablility of drive failures as time increases. Eventually the probability of drive failure will reach 1 since no disk can function forever.

The CDF of a RAID 4 system depends on how long the recovery period is. If the recovery period is low then the CDF of disk failure will also be low. If a RAID 4 system was sent to mars, the CDF of disk failure would an increasing linear function. This is because if one disk were to fail it would be impossible to recover.

NFS Security

With traditional NFS security, the server trusts the client kernel which can cause security issues. It is possible that an attacker could use a bad kernel that masquerades as a trusted user.

Solutions to this security problem
  • Physical Protection - Contain the entire network in a protected environment. This approach is simple and provides the best performace.
  • Virtual Private Network (VPN) - This approach allows clients to connect to the server as if they were on the same local network like the protected environment described above. VPNs are difficult to set up and experience network latency.
  • Individual authentication - Each request from client to server must contain more information than just the user id. NFS v4 implements this technique using kerberos tickets for authentication.

Introduction to Security

What is the difference between traditional and computer security? Traditional security witness attacks mainly by force where as computer security witnesses more attacks by fraud. There are a few examples of computer security issues that are prone to force. These can include installing malicious hardware such as keyloggers or breaking open computer cases to access the BIOS.

Main Forms of attack
  • Attacks against privacy - Unauthorized data release
  • Attacks agaisnt integrity - Tampering with data that you are not allowed to change
  • Attacks agaisnt service - Denial of Service Attacks (DoS)

In order to prevent these listed attacks you need a system that:

  • Disallows unauthorized access - This protects your system agaisnt attacks on privacy and integrity
  • Allows authorized acess - This prevents denial of service attacks

In order to assure your system's security you need to test it. This is not an easy thing to do. Users are most likely not going to report issues or bugs leaving it up to you to find security holes before a malicious user does. By using penetration testing tools, system administrators can test common exploits on their system. Testing load conditions can also be a means to protect agaisnt denial of service attacks.

Threat Modeling and Classification

In order to design a secure system you need to think about the system you are trying to build and who may be trying to break into it. Insider attacks need to be heavily considered because they happen all the time. You should always assume there is going to be a malicious user on your system locally. You also need to be aware of potential social engineering attacks. Clever attackers can try to trick your staff into divulging confidential information. This info may seem harmless at the time but could potentially be the key to the social engineers attack. Systems always have the possibility of suffering a network attack which may include denial of service attacks, viruses, phishing or drive by downloads. Device attacks (most commonly USB viruses) are also possible and can cause serious damage.

General Functions Used for Defense

  • Authentication - process of proving you are who you say you are. An example techinque is to set passwords for your system. You may also want to consider two factor authentication scheme (eg passwords + RSA key)
  • Integrity - detecting attempts to tamper with the data. In order to detect tampered data you can look at timestamps or manage checksums
  • Authorization - keep track of what people are allowed to do. An example would be maintaining an access control list
  • Auditing - keep track of changes to the system and who accessed what and when. Auditing tells you who performed the attack and what specifically they did. This is more of a repair mechanism. An example is to keep logs.
  • Efficieny - determining whether the security measures put an undue burnden on the system.
  • Correctness - system needs to implement the security setup correctly
  • Monitioring and Maintainence - give the ability for administrators to control certain aspects of the system's security.