File Systems
				
				File systems have the following criteria that need to be satisfied in order to operate effectively and efficiently:
				
				Organization
					
				Robustness
					
					A Hybrid Data Structure is a file system that lives on RAM and in Disk
					
						- 1. RAM goes away on power outages → must be present for performance
 
						- 2. Disk is stable storage → what's on disk actually counts
 
					
				Performance
					
					File system performance is achieved through virtual memory
				
			
			Commit Records
				
				Concept - You are allowed to write data onto the disk, but it doesn't count until "Simon Says" it does. The basic outline is as follows:
				
					- 1. Write out new contents of file into blocks
						
							- Old contents located in blocks 0 - 999
 
							- New contents located in blocks 1000 - 1999]
 
						
					 
					- 2. When done, write a commit record to block 2000
 
					- 3. Copy new file to old, clear commit record
 
					- 
						
							
								| Old (0-999) | 
								New (1000-1999) | 
								CR (2000) | 
								… | 
							
						
					 
					- 4. We need to have a way to know which copy to choose if the power fails during the writing process…
						
							- Commit Record (CR) = 1 → Use New Copy
 
							- Commit Record (CR) = 0 → Use Old Copy
 
						
		
					 
				
				Note: Atomicity for a big write is built from many small atomic writes
				
				All-or-Nothing Operations
					We create a pattern, from BEGIN to END, which looks atomic at the next higher level:
BEGIN
						
							- 1. Pre-commit phase: Can back out of whatever change we wanted to make, leaving no trace to our caller
 
							- 2. [If] ABORT: No-op at the higher level (we don't write the commit)
 
							- 2. [Else] Commit Phase: Atomic at the lower level
 
							- 3. Post-commit phase: Doing actions that don't affect whether the commit occurred. Can't back out, can do housekeeping or tuning
 
						
					END
					
			Journaling
				
				When using journaling, we use a different method than in the commit records, namely:
				
					- Don't copy the whole file
 
					- Just record the changes you intend to make
 
					
ABSTRACTION 
					- 
						
							
								| 498 (record this only) | 
								A B C D 9 6 - - - | 
							
						
					 
					
FILESYSTEM 
					- 
						
							
								| Blocks | 
								+ | 
								Journal | 
							
							
								| 
									
								 | 
								
									
										
											| Used | 
											BEGIN | 
											Planned Changes | 
											Commit record | 
											… | 
										 
									 
								 | 
							
							
								| Cell Data | 
								Pending Changes | 
							
						
					 
					- If we crash while writing the Journal, the system will look at the journal to see what hasn't been changed yet.
 
					- 
						
					
 
					- The cache of cell data in RAM keeps track of what blocks should look like even if they haven't been looked at.
 
				
			
			Problems
				
				Could potentially take a long time (if we have a big journal).
				
					- You can update the journal to say that committed actions have been ended
 
					- On reboot, scan backwards through the un-ENDed actions, then replay forwards
 
				
				We could lose power on reboot!
				
					- The scan will still work as long as we can mark commit as done
 
				
			
				What if the system does ABORT, then a power failure occurs?
				
					- We can attack the problem, but we must be careful.
 
					- In this method, the recovery procedure treats ABORT differently
 
				
				
			Performance
				
				
					- 1. Have 2 writes plus a seek instead of 1 write...OUCH
 
					- 
						
							- Partial solutions (methods of attack)
 
							- 
								
									- 
										A. Cache like crazy; keep the disk arm over the journal, then 
										write into cell data only when system becomes idle [Assuming
										system becomes idle]
									
 
									- B. Put the journal on a different disk!
 
								
							 
						
					 
					- 2. No cell data on disk - journal only!
 
					- 
						
							- 
								Boot time will be relatively long, but improves performance of ordinary
								writes since no seeks are involved
							
 
							- Commit record tells you where the root directory is
 
							- Root tells you where subdirectories are…
 
							- Assumes a LOT of RAM
 
						
					 
					Subfield:
 
					- 
						
							- 
								Main Memory Databases - keep journal on disk, don't care about it. The 
								actual data is on the RAM → making the system really fast
							
 
							- 
								Networked Main Memory Databases - get redundant hosts, thus have several
								different copies → improved performance
							
 
						
					 
					- 3. Journal Overflow (gets too big)
 
					- 
						
							- 
								A. Use circular bugger since we don't care about the first part
								if the journal is committed
							
 
							- B. Make it plenty long!
 
							- Stall writes if journal full [until journey empties]
 
						
					 
				
			
			Journals - 2 Kinds
				
				
					1. Write-Ahead Logs
 
					- 
						
							- BEGIN
 
							- Log all planned new values
 
							- Commit
 
							- Install new values into cell data
 
							- END
 
						
					 
					2. Write-Behind Logs
 
					- 
						
							- BEGIN
 
							- Log all old values
 
							- Install new values into cell data
 	
							- Commit
 	
							- END
 					
						
					 
					
 
					- 
						1. Recover after a crash, has roll-forward recovery (goes from the END
						backwards, looking for non-committed ones), then changes to forward direction
					
 
					- 
						2. Rollback recovery - scans backwards only, committing as it goes
					
 
				
						
				
			Performance…again
				
				
					- General Ideas (we want reads to go faster)
 
					- 
						
							- Application says: 
pread (fd, buf, size, location_in_file); 
						
					 
					
 
					- Assume RT-11 file system with contiguous files
 
					- 
						
							pread (fd, buf, size, L1); size = 4096 
							pread (fd, buf, size, L2); 							
						
					 
					
 
					- The Operating System can do this:
 
					- 
						
							
								
									
										lseek(disk, L1 + file_start); 
										read(disk->buf, fsize); 
										lseek(disk, L2 + file_start); 
										read(disk->buf, fsize); 
									 
								 | 
								
									
										- Average Seek Time:
 
										- +  8.9  ms
 
										- +  8.33 ms ~ rotational latency
 
										- +  0    ms read
 
										- = ~17.23ms per read!!
 
									 
								 | 
							
						
					 
					
 
					- Run-Time Optimization wins over Compile-Time Optimization in Operating Systems
 
					How To Make This Faster?
 
					- 
						
							- Choose buffer size equal to size of track to get whole track into RAM
 
							- 1. Rewrite all your applications to know about disk geometry [Not Portable!]
 
							- 
								2. SPECULATION - guessing what the system will do in the near future. Often done
								for reads… OS guesses where the next read will be and reads that data before
								being asked (while its in the neighborhood, in the background)
							
 
							- 
								2. BATCHING - at low level we ask for big things (even if the user [application] asks
								for little things) [minimizes transaction overhead at the lower level]
								
									- Works for writes (often done at many levels)
 
									- 
										Library call 
putchar(x) → write(1, &buf, 512) ←
										//Buffered in RAM, FAST (using 512 putchars)
									 
									fflush(stdout) //Want data in buffer to be written over 
									fsync(1) //Take everything associated with fd1 and make sure it actually hits disk, SLOW 
									- 
										
getchar() only calls one system call → read(0, buf, 8192)
										//Called 8192 times
									 
								
							 
							- 
								3. DALLYING - Puts the write off… 
getchar() → read(…)
								
									getchar() waits for the read() to finish 
									- 
										Can increase throughput of our system by scheduling something else…
										multitasking!
									
 
									- 
										Will constantly keep the CPU busy instead of wasting CPU cycles (you should
										always have one thing to do…
									
 
								
							 
						
					 
				
				
				
					Example
 
					- 
							1.29 GB/s  max internal transfer rate for a read (1287 MB/s)
							0.93 GB/s  max sustained transfer rate for a read (slows because yoyu have to seek from one track to the next)
							=3.0 GB/s  external transfer rate (speed at which disk controller can talk to bus)
					 
					
 
					- 1.2 million hours MTBP (mean time between failures)
 
					- 0.73% AFR (annualized failure rate
 
					- 1015 is the non-recoverable sector read failure rate