CS 111 Lecture 14

File Systems

File systems have the following criteria that need to be satisfied in order to operate effectively and efficiently:

Organization

Robustness

A Hybrid Data Structure is a file system that lives on RAM and in Disk

1. RAM goes away on power outages → must be present for performance
2. Disk is stable storage → what's on disk actually counts

Performance

File system performance is achieved through virtual memory

Commit Records

Concept - You are allowed to write data onto the disk, but it doesn't count until "Simon Says" it does. The basic outline is as follows:

1. Write out new contents of file into blocks
- Old contents located in blocks 0 - 999
- New contents located in blocks 1000 - 1999]
2. When done, write a commit record to block 2000
3. Copy new file to old, clear commit record
Old (0-999) New (1000-1999) CR (2000) …
4. We need to have a way to know which copy to choose if the power fails during the writing process…
- Commit Record (CR) = 1 → Use New Copy
- Commit Record (CR) = 0 → Use Old Copy

Note: Atomicity for a big write is built from many small atomic writes

All-or-Nothing Operations

We create a pattern, from BEGIN to END, which looks atomic at the next higher level:
BEGIN

1. Pre-commit phase: Can back out of whatever change we wanted to make, leaving no trace to our caller
2. [If] ABORT: No-op at the higher level (we don't write the commit)
2. [Else] Commit Phase: Atomic at the lower level
3. Post-commit phase: Doing actions that don't affect whether the commit occurred. Can't back out, can do housekeeping or tuning

END

Journaling

When using journaling, we use a different method than in the commit records, namely:

Don't copy the whole file
Just record the changes you intend to make
ABSTRACTION
498 (record this only) A B C D 9 6 - - -
FILESYSTEM

Blocks

+

Journal

	X
		X
						X
				X
X
			X
					X

Used

BEGIN

Planned Changes

Commit record

…

Cell Data

Pending Changes

If we crash while writing the Journal, the system will look at the journal to see what hasn't been changed yet.
RAM
The cache of cell data in RAM keeps track of what blocks should look like even if they haven't been looked at.

Problems

Could potentially take a long time (if we have a big journal).

You can update the journal to say that committed actions have been ended
On reboot, scan backwards through the un-ENDed actions, then replay forwards

We could lose power on reboot!

The scan will still work as long as we can mark commit as done

What if the system does ABORT, then a power failure occurs?

We can attack the problem, but we must be careful.
In this method, the recovery procedure treats ABORT differently

Performance

1. Have 2 writes plus a seek instead of 1 write...OUCH
- Partial solutions (methods of attack)
- - A. Cache like crazy; keep the disk arm over the journal, then write into cell data only when system becomes idle [Assuming system becomes idle]
  - B. Put the journal on a different disk!
2. No cell data on disk - journal only!
- Boot time will be relatively long, but improves performance of ordinary writes since no seeks are involved
- Commit record tells you where the root directory is
- Root tells you where subdirectories are…
- Assumes a LOT of RAM
Subfield:
- Main Memory Databases - keep journal on disk, don't care about it. The actual data is on the RAM → making the system really fast
- Networked Main Memory Databases - get redundant hosts, thus have several different copies → improved performance
3. Journal Overflow (gets too big)
- A. Use circular bugger since we don't care about the first part if the journal is committed
- B. Make it plenty long!
- Stall writes if journal full [until journey empties]

Journals - 2 Kinds

1. Write-Ahead Logs
- BEGIN
- Log all planned new values
- Commit
- Install new values into cell data
- END
2. Write-Behind Logs
- BEGIN
- Log all old values
- Install new values into cell data
- Commit
- END
1. Recover after a crash, has roll-forward recovery (goes from the END backwards, looking for non-committed ones), then changes to forward direction
2. Rollback recovery - scans backwards only, committing as it goes

Performance…again

General Ideas (we want reads to go faster)
- Application says: pread (fd, buf, size, location_in_file);
Assume RT-11 file system with contiguous files
- pread (fd, buf, size, L1); size = 4096
- pread (fd, buf, size, L2);
The Operating System can do this:

lseek(disk, L1 + file_start);
read(disk->buf, fsize);
lseek(disk, L2 + file_start);
read(disk->buf, fsize);

Average Seek Time:
+ 8.9 ms
+ 8.33 ms ~ rotational latency
+ 0 ms read
= ~17.23ms per read!!

Run-Time Optimization wins over Compile-Time Optimization in Operating Systems
How To Make This Faster?
- Choose buffer size equal to size of track to get whole track into RAM
- 1. Rewrite all your applications to know about disk geometry [Not Portable!]
- 2. SPECULATION - guessing what the system will do in the near future. Often done for reads… OS guesses where the next read will be and reads that data before being asked (while its in the neighborhood, in the background)
- 2. BATCHING - at low level we ask for big things (even if the user [application] asks for little things) [minimizes transaction overhead at the lower level]
  - Works for writes (often done at many levels)
  - Library call putchar(x) → write(1, &buf, 512) ← //Buffered in RAM, FAST (using 512 putchars)
  - fflush(stdout) //Want data in buffer to be written over
  - fsync(1) //Take everything associated with fd1 and make sure it actually hits disk, SLOW
  - getchar() only calls one system call → read(0, buf, 8192) //Called 8192 times
- 3. DALLYING - Puts the write off… getchar() → read(…)
  - getchar() waits for the read() to finish
  - Can increase throughput of our system by scheduling something else… multitasking!
  - Will constantly keep the CPU busy instead of wasting CPU cycles (you should always have one thing to do…

Example
1.29 GB/s max internal transfer rate for a read (1287 MB/s)
0.93 GB/s max sustained transfer rate for a read (slows because yoyu have to seek from one track to the next)
=3.0 GB/s external transfer rate (speed at which disk controller can talk to bus)
1.2 million hours MTBP (mean time between failures)
0.73% AFR (annualized failure rate
10¹⁵ is the non-recoverable sector read failure rate

Ben Harounian, Dana Sadgat, Malavika Ravi

File Systems

Organization

Robustness

Performance

Commit Records

All-or-Nothing Operations

Journaling

+

Problems

Could potentially take a long time (if we have a big journal).

What if the system does ABORT, then a power failure occurs?

Performance

Subfield:

Journals - 2 Kinds

1. Write-Ahead Logs

2. Write-Behind Logs

Performance…again

How To Make This Faster?

Example