File Systems
File systems have the following criteria that need to be satisfied in order to operate effectively and efficiently:
Organization
Robustness
A Hybrid Data Structure is a file system that lives on RAM and in Disk
- 1. RAM goes away on power outages → must be present for performance
- 2. Disk is stable storage → what's on disk actually counts
Performance
File system performance is achieved through virtual memory
Commit Records
Concept - You are allowed to write data onto the disk, but it doesn't count until "Simon Says" it does. The basic outline is as follows:
- 1. Write out new contents of file into blocks
- Old contents located in blocks 0 - 999
- New contents located in blocks 1000 - 1999]
- 2. When done, write a commit record to block 2000
- 3. Copy new file to old, clear commit record
-
Old (0-999) |
New (1000-1999) |
CR (2000) |
… |
- 4. We need to have a way to know which copy to choose if the power fails during the writing process…
- Commit Record (CR) = 1 → Use New Copy
- Commit Record (CR) = 0 → Use Old Copy
Note: Atomicity for a big write is built from many small atomic writes
All-or-Nothing Operations
We create a pattern, from BEGIN to END, which looks atomic at the next higher level:
BEGIN
- 1. Pre-commit phase: Can back out of whatever change we wanted to make, leaving no trace to our caller
- 2. [If] ABORT: No-op at the higher level (we don't write the commit)
- 2. [Else] Commit Phase: Atomic at the lower level
- 3. Post-commit phase: Doing actions that don't affect whether the commit occurred. Can't back out, can do housekeeping or tuning
END
Journaling
When using journaling, we use a different method than in the commit records, namely:
- Don't copy the whole file
- Just record the changes you intend to make
ABSTRACTION
-
498 (record this only) |
A B C D 9 6 - - - |
FILESYSTEM
-
Blocks |
+ |
Journal |
|
Used |
BEGIN |
Planned Changes |
Commit record |
… |
|
Cell Data |
Pending Changes |
- If we crash while writing the Journal, the system will look at the journal to see what hasn't been changed yet.
-
- The cache of cell data in RAM keeps track of what blocks should look like even if they haven't been looked at.
Problems
Could potentially take a long time (if we have a big journal).
- You can update the journal to say that committed actions have been ended
- On reboot, scan backwards through the un-ENDed actions, then replay forwards
We could lose power on reboot!
- The scan will still work as long as we can mark commit as done
What if the system does ABORT, then a power failure occurs?
- We can attack the problem, but we must be careful.
- In this method, the recovery procedure treats ABORT differently
Performance
- 1. Have 2 writes plus a seek instead of 1 write...OUCH
-
- Partial solutions (methods of attack)
-
-
A. Cache like crazy; keep the disk arm over the journal, then
write into cell data only when system becomes idle [Assuming
system becomes idle]
- B. Put the journal on a different disk!
- 2. No cell data on disk - journal only!
-
-
Boot time will be relatively long, but improves performance of ordinary
writes since no seeks are involved
- Commit record tells you where the root directory is
- Root tells you where subdirectories are…
- Assumes a LOT of RAM
Subfield:
-
-
Main Memory Databases - keep journal on disk, don't care about it. The
actual data is on the RAM → making the system really fast
-
Networked Main Memory Databases - get redundant hosts, thus have several
different copies → improved performance
- 3. Journal Overflow (gets too big)
-
-
A. Use circular bugger since we don't care about the first part
if the journal is committed
- B. Make it plenty long!
- Stall writes if journal full [until journey empties]
Journals - 2 Kinds
1. Write-Ahead Logs
-
- BEGIN
- Log all planned new values
- Commit
- Install new values into cell data
- END
2. Write-Behind Logs
-
- BEGIN
- Log all old values
- Install new values into cell data
- Commit
- END
-
1. Recover after a crash, has roll-forward recovery (goes from the END
backwards, looking for non-committed ones), then changes to forward direction
-
2. Rollback recovery - scans backwards only, committing as it goes
Performance…again
- General Ideas (we want reads to go faster)
-
- Application says:
pread (fd, buf, size, location_in_file);
- Assume RT-11 file system with contiguous files
-
pread (fd, buf, size, L1);
size = 4096
pread (fd, buf, size, L2);
- The Operating System can do this:
-
lseek(disk, L1 + file_start);
read(disk->buf, fsize);
lseek(disk, L2 + file_start);
read(disk->buf, fsize);
|
- Average Seek Time:
- + 8.9 ms
- + 8.33 ms ~ rotational latency
- + 0 ms read
- = ~17.23ms per read!!
|
- Run-Time Optimization wins over Compile-Time Optimization in Operating Systems
How To Make This Faster?
-
- Choose buffer size equal to size of track to get whole track into RAM
- 1. Rewrite all your applications to know about disk geometry [Not Portable!]
-
2. SPECULATION - guessing what the system will do in the near future. Often done
for reads… OS guesses where the next read will be and reads that data before
being asked (while its in the neighborhood, in the background)
-
2. BATCHING - at low level we ask for big things (even if the user [application] asks
for little things) [minimizes transaction overhead at the lower level]
- Works for writes (often done at many levels)
-
Library call
putchar(x)
→ write(1, &buf, 512)
←
//Buffered in RAM, FAST (using 512 putchars)
fflush(stdout)
//Want data in buffer to be written over
fsync(1)
//Take everything associated with fd1 and make sure it actually hits disk, SLOW
-
getchar()
only calls one system call → read(0, buf, 8192)
//Called 8192 times
-
3. DALLYING - Puts the write off…
getchar() → read(…)
getchar()
waits for the read()
to finish
-
Can increase throughput of our system by scheduling something else…
multitasking!
-
Will constantly keep the CPU busy instead of wasting CPU cycles (you should
always have one thing to do…
Example
-
1.29 GB/s max internal transfer rate for a read (1287 MB/s)
0.93 GB/s max sustained transfer rate for a read (slows because yoyu have to seek from one track to the next)
=3.0 GB/s external transfer rate (speed at which disk controller can talk to bus)
- 1.2 million hours MTBP (mean time between failures)
- 0.73% AFR (annualized failure rate
- 1015 is the non-recoverable sector read failure rate