Jonathan Woong

CS 111 Lecture 14 (February 29, 2016)

----------The GZIP Bug----------

Using the gzip command to compress file(s) may result in data loss.

Example (previous lecture):

$ gzip foo will execute the following code:

1 inputfd = open("foo",O_RDONLY); 2 if (inputfd < 0) 3 { error; } 4 outputfd = open("foo.gz", O_WRONLY|O_CREAT|O_EXECL, 0600); 5 if(output(fd < 0)) 6 { error; } 7 read(inputfd, buf, bufsize); // + check for I/O errors 8 write(outputfd, buf, bufsize); // + check for I/O errors 9 if(close(inputfd) != 0) 10 { error; } 11 if(close(outputfd) != 0) 12 { error; } 13 if(unlink("foo") != 0) 14 { error; } // if the program crashes here, foo would be lost! 15 exit(0);

Solution (synchronous):

// error checking omitted 1 dfd = open(".", O_SEARCH); 2 fd = openat(dfd, "foo.gz", O_WRONLY|O_CREAT|O_EXECL, 0600); 3 fsync(fd); // save write (slow) 4 fdatasync(dfd); // save file creation (slow) 5 close(fd); 6 unlinkat(dfd, d"foo"); // unlink within contents of dfd

Use:

$ gzip --synchronous [arg]

Implementing the synchronous flag has two implications:

The default behavior of gzip without flags is unsafe
We'd have to change every other data-motifying system call to address the same issue (e.g. cp or mv)

----------Filesystem Reliability----------

Two methods: commit record and journal

1. Commit Record: a tentative method of ensuring write jobs are done in completion (assumes individual sectors can be written atomically)

writes to sectors	1	2	3	...	n (final write)	n+1 (commit record)

Writes to the sectors occur sequentially with "n" being the final write.
The "n+1" write is a commit record that indicates all of the write jobs have been completed.
If the write jobs cannot be completed (e.g. power out) before the commit record is written, the entire batch is discarded.
Reboot: If the system is rebooted after the commit record is written, the system will re-write from the beginning of the commit record to the end of the commit record.

2. Journal: memory is divded into a circular buffer (log) and an array of data blocks (cell memory)

log start	write 1	write 2	write 3	...	write n	commit	log end

cell data	data block 1	data block 2	data block 3	...	data block n

Journal Methods:

write ahead log (WAL)
write behind log (WBL)
log only
partitioning

I. WAL: a method that tracks intended changes to data (in the log) and executes them after commission (in cell memory)

Pre-commit: intended writes are appended to a log that tracks which data blocks to write to/from, the log has a start and end point
Post-commit: once the commit is reached, the system will execute the writes (to/from cell memory) specified in the log
Recovery/Reboot: the system must scan backwards (from log end) for the oldest completed log entry, then from that point begin re-executing until log end is reached

II. WBL: a method that reads old values (from the log) and reverts data (in cell memory) to the log specifications

Pre-commit: old log values are stored into the log
Post-commit: any discrepancies between the log and cell memory are resolved by updating the cell memory to match the log
Recovery/Reboot: the system does one scan backwards and updates cell data accordingly

Advantages of WBL

Faster recovery (only requires one scan backwards)
Fewer sectors to write compared to WAL

Disadvantages of WBL

Wastes resources reading old data from the disk (O.K. if you needed old data anyway)

III. Log-only: a method that treats completed log entries as completed writes, but requires periodic garbage collection when the log end wraps and meets the log start

Advantages of log-only

Fast for systems that require a large number of small writes

Main-memory database (variant): all data and log entries kept in main memory (RAM) for speed and persistency

Advantages of main-memory database

Fast and persistent data access

Disadvantages of main-memory database

All data is lost when power is lost

IV. Partitioning: separating data that must be persistent from data that is acceptable to lose

Example: Ext4 filesystem (journaling, fast read/writes, no fragmentation) vs. VFAT (windows compatibility)

How namei distinguishes between two filesystems: refer to an array of filesystem-inode pairs (dev_t filesystem, ino_t inodenumber) to ensure correct directory searching

How to Speed Up Journaling: store log (sequentially read) and cell memory on separate disks

----------Identifying/Fixing Unreliable Programs----------

How to prevent invalid memory references:

Hire perfect programmers
Code in bounds-checking languages (java/python)
Hardware bounds-checking

Hardware Method: use two kernel-level hardware registers as boundaries; every access to memory must fall between them

base (start of program)	accessible memory	bounds (end of program)

Context switching: requires switching the base and bounds registers

Advantages of hardware register boundaries

Quick operation for batch work

Disadvantages of hardware register boundaries

Can't grow program size dynamically
Fragmentation

Solution to fragmentation (x86-64): segmentation

Divide 32-bit memory addresses into an 8-bit segment number and 24-bit offset within segment

memory address:	8-bit segment number (0-255)	24-bit segment offset

Segment number-offset pairs are stored in a segment table

segment number	base	bounds
0	address	address
1	address	address
2	address	address
...	...	...
255	address	address

Two types of segment tables: stack (for stack data) and text (for code)

Advantages of segmenting addresses

Ability to grow/move stack by adjusting base and bounds registers

Disadvantages of segmenting addresses

Expensive to move programs around

----------Virtual Memory----------

Per-process paged based memory addressing that only requires a base address

Divide 32-bit virtual addresses into a 22-bit virtual page number (VPN) and 10-bit virtual page offset (VPO)

virtual address:	22-bit VPN	10-bit VPO

A 22-bit VPN is sent to VM hardware onboard that translates it into physical page, then the VPO is added to obtain a physical address

Uses for Virtual Memory:

I. Running 16 GB programs on machines with 8 GB of RAM

Slow if attempting to access memory in a random way (many page faults)
~20 million times faster if no disk access is required

II. Allowing programs to "safely" share memory

Example: sendmessage(destinationProcess, message) a 4096-byte aligned process
message: pointer to a buffer of size 4096

How the message is sent: obtain the address of the first message by looking at the sender's page table and dividing the message pointer by 4096, then copy that address to the receiving process's page table

How malloc works: malloc is implemented via mmap

mmap(fdToMmapFrom, offsetIntoFile, sizeOfRegion, VMaddr) allows a process to control its own page table

fdToMmapFrom can point to source /dev/zero to malloc zeroed-out space

Memory Layout

View RAM as cached swap space (split into pages) where page table entries contain pointers to pages in RAM and can be swapped out for pages in the RAM swap space

RAM:	swap page 1	swap page 2	swap page 3	...	swap page n	program data

page table
page 1
page 2
page 3
page 4
...
page n

On page fault: evict a victim page from page table and replace with a page from the swap space