by Weichen Huang, Yuanzhi Gao, Pengcheng Pan, Yifang Liu
In Understanding the Robustness of SSDs under Power Fault by Zheng et al. at FAST '13, out of 15 SSDs, only 2 worked when they cut the power. The 13 SSDs that failed to work suffered from :
Further explanation of flying writes : Flash drive is not as reliable as hard drive. Flash may wear out after writing to the flash drive 10 billion times.
What flash controller does to overcome this problem is they use wear-leveling.
You think you are reading from block zero, but you are actually reading wherever block zero happens to be today because the blocks are moving around. So when you lose power when writing to flash, you may lose the table that maps the virtual and physical location, which will give inconsistancy.
Question : How should ^x^s (SAVE FILE) work on a raw device?
First Attempt :
Problem : What if unplugged at the middle of the write back?
GOLDEN RULE OF ATOMICITY : Never overwrite your only copy, i.e. write into a different location.
Second Attempt :
Instead of overwriting your original data, we will overwrite this area so that if the system crashes in the middle, you can continue to refer to your old copy it was not changed and once you finished writing it, you will have the new copy saved on disk as well. So in this way, we obey the golden rule of atomicity.
Problem : How do you know which copy you are using after rebooting?
Timestamp Approach :
What we can do at application level is that we can put a timestamp in each file. In operating system level is to put timestamp into every block that you write on the disk, which might help you to achieve the reliability here. But that can waste space in each block. Timestamp approach will work but it got its own problem(occupy too much space in block).
Master Control Block :
MCB is a block that we keep separate from files. We just store one bit into the master control block and if this bit is zero that means this is the old version of the file. if it's one that means this is the new version of the file.
But this approach does not survive all of the problems. For example, it won't survive the flying write, bit corruption: It actually can't survive any of the problems! But the master control block approach is better if we assume we are one of those 2 of 15 SSD's that actually work without having those problems (bit corruption, flying write).
Problem with using only one MCB :
Block size in file system is large. Our problem is that a write we have at file system level can’t be implemented atomically at the lower level because the device can’t simply write 8 KB block atomically a time. Let’s suppose writes can be partial at the lower level. Let’s say we pick an 8 KB file system block and the device is 512 bytes sectors. The way you can model it is: if you issue a high level write of 8KB, at the low level, you can issue 8 writes. The way to do it is to issue a bunch of writes to the disk controller and we are going to wait to get 16 responses back.
When you have a block A and you issue a write so that the new contents in the block is B instead of A, in between, the block has indeterminate content (some mixture of A and B that we don’t care). Our model is in the middle of a write, if you lose power, the contents are indeterminate. We need an algorithm that helps us reliably tell whether we store a zero or one with its data.
One idea : we have two copy of MCB instead of just having one.
They both start up with the same content
Then we start to write in MCB1
As time goes on
Notice that at all time during execution, we have either A or B stored, so we will always have reliable data.
Problem : we can’t tell which one is garbage and which one is correct after you reboot.
Another idea : use three MCBs
They start up having the same value
Then we start to write in MCB1
As time goes on
After reboot, you look at all 3 :
- a later read can detect the bad sector
Procedure to complete this request:
Because directory entry varies in length, it can be longer. Rename remains atomic as long as the entry fits into block.
Two hard links link to the same file and its link count is 1. When you unlink ("d/a"), link count will go to 0, and thus you will get a dangling link. So what is a safe procedure?
If you lose power, assume fsck is idempotent.
A commit is the single, low-level write operation that matters. It is the operation that actually determines whether or not a file has been written.