CS111 Lecture Notes
Lecture 13 – Wednesday,
February 27, 2008
By: Brian Hackel and Chris
Hui
File
System Robustness
Recall our rename(ÒaÓ, ÒbÓ)
function from last lecture. It just ÒzapsÓ a block in-place. It is Òatomic.Ó
(quotes here are because we still do have to worry about the write to disk
possibly failing, but otherwise, it is mostly robust.
Now consider a harder
problem:
rename(Òd1/aÓ,
Òd2/bÓ);
Here
we have to change two things on the disk: the new directory and the old
directory.
But
what if the power fails in between these two changes?
A
schedule for the above rename action could be as follows:
1) read d2Õs directory contents into main memory
2) read d1Õs directory contents into main memory
3) clear the ÒaÓ entry out of main memory (RAM) in d1
block
4) write d1 block to disk
5) create ÒbÓ entry in main memory (RAM) in d2 block
6) write d2 block to disk
But
what can go wrong with this schedule?
Suppose
we crash (power failure) between steps 4 and 5:
Now
we have two directory entries pointing to the same inode.
So
after reboot, say we unlink(Òd1/aÓ).
Then
the link count on the inode goes to zero, and we reclaim it with its data
blocks.
So
then if someone else creates a new file, it is given that inode (since itÕs
free now),
And
then Òd1/bÓ points to that new file!
This leads us to the notion
of file system correctness invariants (Boolean expressions that are always
true).
File System Correctness
Invariants:
1) Each block is used for exactly one purpose: boot
block, member of superblock, bitmap block, inode block, indirect block, data
block, directory block.
2) All referenced blocks are initialized with appropriate
data.
3) All referenced blocks are marked ÒusedÓ in the bitmap.
4) All unreferenced blocks are marked ÒfreeÓ in the
bitmap.
Consequences of violation of
the above (punishment for breaking the law):
1) Disaster – we can lose data and likely crash.
a. For example, if we put data in an indirect block and
then try to look up data blocks via that same indirect block, weÕre likely
going to be pointing toward garbage and crash.
2) Disaster – we can lose data and likely crash.
a. For example, we could end up writing over some data
that is already on the disk
3) Disaster – we can have two files writing into
the same data block because it was allocated twice (once for each file because
its bit was never switched to ÒusedÓ in the bitmap
4) Not disaster, but inefficiency – we can have a
block that never gets used because the file system always reads it in the
bitmap as ÒusedÓ since the bit will never get switched to ÒfreeÓ because it
does not really represent a used block.
a. Thus, if we HAVE to give up one of the invariants,
this (#4) would be it, and most file systems do. On crash we might lose track
of blocks and they never get used again.
b. In Unix: file system check program (fsck)
i. On reboot, it walks through the file system looking
for unused blocks and repairs violations of #4 if any are found by switching
the appropriate bit in the bitmap to ÒfreeÓ state.
Now we can revisit our
schedule for rename() to make it satisfy the invariants:
Move
steps 5 and 6 up to sit in between 2 and 3.
1) read d2Õs directory contents into main memory
2) read d1Õs directory contents into main memory
3) create ÒbÓ entry in main memory (RAM) in d2 block
4) write d2 block to disk
5) clear the ÒaÓ entry out of main memory (RAM) in d1
block
6) write d1 block to disk
But
now in the case where we crash after new step 4 (writing d2Õs block to disk),
then we end up with two hard links to the file, but the link count in the inode
is 1. Thus if we remove the file after reboot, the link count goes back to zero
and the block gets reclaimed and thus we lose data.
To fix, we add new steps:
1) read d2Õs directory contents into main memory
2) read d1Õs directory contents into main memory
2.1) read inode of file ÒaÓ into RAM
2.2) increment the link count of the inode
in RAM
2.3) write inode back to disk
3) create ÒbÓ entry in main memory (RAM) in d2 block
4) write d2 block to disk
5) clear the ÒaÓ entry out of main memory (RAM) in d1
block
6) write d1 block to disk
6.1) decrement link count of inode in RAM
6.2) write inode back to disk
But now we have 4 writes to
disk and not 2. DoesnÕt this cause MORE of a problem?
Yes,
it is a lot slower, but consider what weÕve gained:
Consider the possible crash
points in the new scheme: a crash between any pair of writes.
- If
we crash after 2.3 and delete the file on reboot, we donÕt lose data, but now
robustness invariant #4 will be violated because weÕll have a block that is not
linked to but cannot be used again.
- If
we crash after 4 we have two hard links to the same file and have basically
done the equivalent to Òln a bÓ rather than our intended Òmv a bÓ. It wonÕt
cause a crash. ItÕs not GOOD, but it wonÕt cause a disaster either.
- If
we crash after 6, it is the same as the crash after 2.7: one hard link, but
link count is 2.
- If
we crash after 6.2, weÕre ok because our rename() operation has completed.
File System Robustness
Terminology
(conforming to IEEE standard):
error:
a mistake made by the designer (in the designerÕs head)
fault:
defect in hardware or software that may cause a problem
failure: systems behaves incorrectly (perhaps from an
earthquake or a stray gamma ray)
File
System Goals – to be robust in all three of the above cases (or at least
to survive most probably failures)
Ideally,
we strive to make system failure rate lower than the failure rates of its
constituent parts.
Durability: data should survive (limited) hardware failures
Atomicity: changes are either made or not; all or nothing
-
our rename() function schedule fails this because a ÒrenameÓ can turn into a ÒlinkÓ
Performance: high throughput, low latency (these are often
competing goals though)
Suppose an application wants
to do a lot of writes (all kinds of writes, just data)
The
Operating System will batch these writes into a buffer such that the application
thinks the writes are done, but they have not hit the disk yet.
Usually
they are batched in disk order to minimize seeks, but this will mess with our
robustness we worked so hard for previously in our rename() example!
Does this mean we cannot
reorder writes at all? Can we have no optimization writing to disk? Or can we
do some sort of reordering and still maintain our robustness?
Suggestions:
A)
Reorder only sequences of writes to data bocks
Simple
Example: write(fd, buf, 16384)
//write 2 blocks
If
we crash such that the second block wrote but the first did not (batched writes
to data blocks), is this ok? It depends on our correctness criteria. But we
have already defined our four invariants, and with them, this strategy works
out!
B)
Is it ok to reorder bitmap block writes?
ThatÕs
a great question for the FINAL EXAM!
(think about it)
More
general approach:
O.S.
keeps track, for each written block, which read/writes it depends on (keeping a
dependency graph for writes). And then the disk scheduler will check and
respect these dependencies.
This
is the FreeBSD ffs approach and is used in Mac OS X, but it took a lot of work
to get it there.
Ideas for making the file system more robust:
1) Be careful about reordering writes
2) Journaling: This is an idea taken from double-entry
bookkeeping in accounting. Whenever a transaction is made, it is recorded in
two places, not just one. The records are permanent; once it is written down it
cannot be changed or erased. ÒUse ink, never erase.Ó More about this idea
later.
3) Duplicate writes/ RAID (Redundant Array of Inexpensive
Disks): Use multiple hard drives to improve performance and robustness.
Ideas 1 and 2 protect against
power failures. Idea 3 protects against disk fries.
If we go with the journaling
approach, our system will be made of 2 parts:
Journal: The journal is
stored on a separate file system (it might even be on a different disk). It is
append-only, and entries can only be accessed sequentially (no seeks). Before
we execute a write (actually changing the cell storage), we first have to make
2 entries in the journal:
1)
The first entry tells the system that a write is taking place and what it will
do.
2)
The second entry tells the system that we are committed to performing the
write.
Cell storage: The normal file
system that we expect. Contains data blocks, indirect blocks, inodes, bitmap
blocks, etc.
If our system crashes, fsck
is run to check/repair the file system. If fsck sees a write that we were
committed to, but never finished, it will update the cell data so it was if the
write was carried out.
Assumptions/requirements:
1) Replaying journal must be idempotent (same results
when it is done multiple times)
2) WhatÕs in the log:
BEGIN
CHANGE-
planned changes
CHANGE
É
OUTCOME-
either a commit or abort
END-
all the changes are accomplished in cell storage
Cache in controller: It would
be bad if writes were cached, then the system crashed. Possible solutions:
A)
DonÕt use the cache for writes
B)
Use a nonvolatile cache (NVRAM): Memory that lasts even when the power is off
C)
Use a capacitor to act as an emergency battery/UPS (Ugly Power Supply)
D) A
smart controller that knows about read/write ordering
Idea: Keep cell storage in
RAM instead of disk. Since this takes up a lot of memory, we should only cache
the parts we need (such as recent files)
-Pros: Writes will
be faster.
-Cons:
Reads will be slower. We will need a garbage collector since the journal will
run out of space. The garbage collector has to be incremental (the garbage collects
will have to take place as normal programs are running).
To implement a journal we can
either use a write ahead log or a write behind log.
-Write ahead (Roll forward):
The idea is that when the system crashes, we can still perform the writes we planned
to do but werenÕt able to. This is the idea that we have discussed so far.
1)
Log all planned actions for a transaction
2)
Commit
3)
Install new values into database
If we run fsck with write
ahead, the system will scan backwards through the journal looking for writes
that we committed to, but were not carried out. Then it will do a second scan,
going forward and applying the new values to the cell data.
-Write behind (Roll back):
The idea is that when the system crashes, we restore the system back to its old
state. We assume all cell data is persistent.
1)
Log all old values
2)
Install new values
3)
Commit
If we run fsck with write
ahead, the system will scan backwards through the journal looking for data that
was installed but we did not commit to. We will undo these changes by
installing the old data back. Note that this requires only 1 pass.
These ideas of write-ahead
and write-behind can be mixed together.
RAID (Redundant Array of
Inexpensive Disks) - This is the idea of using multiple hard drives to increase
performance.
RAID 0: 2 hard drives
connected to each other. Sets are striped, so that the data blocks alternate
between the 2 disks.
Pros:
Increased capacity, parallel I/O for faster performance
Cons:
Less robust. Once a single disk fails, both disks fail since the data is mixed
into both.
A1 |
A2 |
A3 |
A4 |
A5 |
A6 |
A7 |
A8 |
RAID 1: Mirroring: 2 hard
drives connected to each other. Both drives contain the same data.
Pros:
More robust. We have additional protection against crashes since the other disk
acts as a backup. Increased read speed.
Cons: Slower write
speed
(we have to write to twice as many disks)
A1 |
A1 |
A2 |
A2 |
A3 |
A3 |
A4 |
A4 |