CS 111 Spring 2014

Lecture 10: File System Performance (With More on Deadlocks)

by Kevin Sheu and Zeeshan Pirzada

More on Deadlock

Deadlock has 4 conditions:

Circular Wait - processes are waiting for resources held by other processes, which are waiting for other resources
Mutual Exclustion - holding a resources excludes other processes from getting that resource
No Pre-emption - processes do not have the ability to take resources from others if they have it
Hold & Wait - processes are holding one resource while waiting for another

Solution: In order to avoid deadlock, you would need to design a system without one of these things.

The circular wait is the most popular to attackin system design.

To avoid deadlocking, we would need to look for cycles on each wait operation, such as the one shown above. Then, we would not allow a process to acquire another resource.

The cost for checking for this type of deadlock with a for loop with P processes and R resources would have time completxity of O(PR). However, this is a naive implementation, and we would obviosly prefer a solution with a significantly lower order, ideally O(1). Another solution for this would be lock ordering:

Processes would lock resources in increasing order (e.g. by machine address)
Every resource would be placed in this ordering, and the ordering would be the same for every process.
If a process finds a lock that is busy, it will give all of the locks up and start over
- This eliminates the deadlock condition of hold and wait.
O(1)!

Quasi-Deadlock

Priority Inversion

Solution: high priority waits for resources, and lends its priority to who has the resources.
- ex) T-low becomes high till releases lock, then it becomes T-low again

Recieve Livelock

Below is a graph showing the typical trends for receiving requests:

Here is a scenario to explain the awful curve:

Problem: The CPU is doing two things:

The interrupt service routine grabs data and puts the data into the bounded buffer.
Computations occur during the work routine

However, too many requests are coming in, so the CPU becomes swamped because the interrupt service routine has a higher priority.

Soution:

Check how full the buffer is when the work routine starts
if it fills to a certain percentage (e.g. ~90%), disable and ignore the interrupts service routine
if the buffer % falls to say 50%, continue to accept interrupts, and the graph will straighten out.

Event-Driven Programming

Data Races are Evil:

Deadlocks occur because resources exist that need to be held and locked.
The problems and solutions exist because of data race conditions
Essentially, the underlying problem is that we have more than one thread running.

Alternative Solution: just use one thread

However, we still want to be able to wait for events.
This approach is called event driven programming.

Example:


	0	while (true) {
	1		wait_for_event(e);
	2		e->handler;
	3	}

Key Idea: the handlers themselves can not wait
- only part of the system that waits is wait_for_event(e)
Fundamental Property: the handler has to be fast enough so processes do not have to wait long for others

Positive (+)

No locks
No deadlock
No bugs (that arise from deadlocking)

Negative (-)

Handlers are restricted (we may be forced to break handelers)
Can not scale via multithreading, so we are forced to scale via multiprocess or multisystem

Hardware Lock Elision

One major problem with spin locking is that too much CPU time is spent waiting for the locks to lock/unlock.
Solution: Get hardware support in order to get the effects of spin locks without using spin locks

Example Code:


	0	lock:
	1		movl     $1, %eax
	2	try:
	3		xacquire lock xchgl %eax, lockvar
	4		cmp      $0, %eax
	5		jnz      try
	6		ret
	7	unlock:
	8		xrelease movl $0, lockvar
	9		ret

The xacquire prefix allows optimistic execution of a critical section by eliding the lock, so the lock appears free to other threads.
The xrelease prefix lets us check if the lock was actually obtained. If it wasn't, then the previous operations are undone and are tried again.

File System Performance Ideas

Below are some ideas that can be used to enhance a file system's performance:

Striping

Assign a piece of file on to each disk, so we can read in parallel

Distributed Metadata

metadata = info about file location, ownership, timestamp, etc.
Every node will contain metadata about the entire file system
Each node effectively becomes a master node for the system

Efficient Directory Indexing

If a directory contains over 10,000,000 files, there will need to be an efficient internal data structure, such as a B+ tree.

Partition Awareness

Ideally, the file system will spread over the cloud
Occasionally there will be network partition (as a result of accidents)
If you are accessing smaller side, there will be only read only access
if you are accessing the larger side, you can both read and write.
Basically, the largest partition stays live

File System Staying Live During Maintenence

Upgrading, adding files, removing files, etc. will not affect whether the system is still functioning

Performance Metrics For File Systems

Performance metrics:

Latency - delay from request to response
Throughput - rate of request completions
Utilizations - function of capacity doing useful work (CPU, bus bandwidth)

Example: Assume a system with these characteristics:

CPU cycle time = 1 nanosecond
Program Input/Output (PIO) = 1000 cycles = 1 microsecond
Sending a command to SSD (Secondary Storage Device) = 5 PIOs = 5 microseconds
SSD Latency = 50 microseconds
Computation per buffer = 5 microseconds
Reading 40 bytes from SSD = 40 PIOs = 40 microseconds
Blocking overhead = 50 microseconds

Basic busy wait + polling approach for reading:


		for(;;) { 
	0       char buf[40];
	1       read 40 bytes from SSD to buf;
	2       compute(buf+i*40);
	3   }

Latency: (microseconds)

Sending command to SSD	5
SSD latency	50
Read latency	40
Computation	5
Total	100

Throughput = 1/latency = 10,000 requests/second
Utilization: 5/100 = 5%
This is okay, but we can definitely go faster!

Below are different ideas for speedup:

Batching

Basic Idea: grabbing bigger blocks of data (bigger buffer)


	0   for (;;) {
	1       char buf[840];
	2       read 840 bytes from SSD into buf;
	3       for (i=0; i<21; i++)
	4           compute(buf+i*40);
	5   }

Latency:

Sending to SSD	5
SSD latency	50
Read latency	21 * 40 = 840
Computation	21 * 5 = 105
Total	1000

Throughput = 21/1microsecond = 21,000 req/sec
Utilization: 105/1000 = 10.5%

Interrupts


	0   for (;;) {
	1       block_until_interrupt;
	2       do
	3           handle_interrupt;
	4       while (!ready)
	5   }

Latency:

Send to SSD	5
Blocking	50
Handling	5
Read latency	40
Computation	5
Total	106

Throughput = 1/50 microsecond = 17,657 req/sec
Utilization: 5/56 = 8.9%

DMA (Direct Memory Access)


	0   while(1) {
	1       write command to disk;
	2       block until interrupt;
	3       check disk if ready;
	4       compute;
	5   }

Latency:

Blocking	50
Handling	5
Checking if ready	1
Computation	5
Total	61

Throughput = 1/11 microsecond= 91,000 req/sec
Utilization: 5/11 = 45%

However, we can give up on interupts and just use the polling approach.

DMA + Polling


	0   while(1) {
	1       write command to disk;
	2       while (DMA slots are not ready)
	3           schedule;
	4       check disk if ready;
	5       compute;
	6   }

Latency:

Blocking (scheduling)	50
Checking if ready	1
Computation	5
Total	56

Throughput = 1/6 microsecond= 1666,666 req/sec
Utilization: 5/6 = 84%

We can see that the utilization rate is extremely high for DMA with polling. Consquently, this is one of the more popular methods!

CS 111 Spring 2014

Lecture 10: File System Performance (With More on Deadlocks)

by Kevin Sheu and Zeeshan Pirzada

Table of Contents

More on Deadlock

Quasi-Deadlock

Priority Inversion

Recieve Livelock

Event-Driven Programming

Hardware Lock Elision

File System Performance Ideas

Striping

Distributed Metadata

Efficient Directory Indexing

Partition Awareness

File System Staying Live During Maintenence

Performance Metrics For File Systems

Batching

Interrupts

DMA (Direct Memory Access)

DMA + Polling