Name: Genry Tovmasian ID: 504306339 Synchronization (cont) 2/10/2016 speedup methods. 1. the main problem we have is that one thread owns the lock and other threads are spinning 'yield()' then 'continue' each time [for the context switch] and we basically want the kernel/scheduler to never wake up that process in order to not waste that overhead time of switching processes How do we do that? a. Use a blocking mutex (aka binary semaphores) struct bmutex { lock_t lock; //controls access bool locked; // true or false based on locked(==true) or not struct proc *waiter; // Linked-list of blocked processes // waiting on bmutex to be unlocked } 2. Need a method to acquire resources when you are finally ready and the conditions have been met. (assuming a spinlock has been implemented outside of this code and has been spinning and has finally entered into acquire function) implementation:(only left in final implementation) void acquire(struct bmutex *b) { lock(&b->lock) if(!b->locked) { b->locked = 1; unlock(&b->locked); return } // add to front of list of waiting processes curproc->next = b->waitor; // current process pointer in register b->waiter = curproc; // lock current process curproc->locked = true; // need to make sure to unlock the lock before yield so as the // next process that gets to run can access resource(s) unlock(&b->lock); yield(); } 3. Then we need to release the mutex so that we can actually schedule the process void release(struct bmutex *b) { // Need to wake up the waiting processes // We'll wake just the first b->locked = false; struct proc *w = b->waiter; if(w) { b->waiter = w->next; unlock(&b->lock); } else { w->blocked = false; } } But wait, race condition again! void release(struct bmutex *b) { lock(&b->lock) // Need to wake up the waiting processes // We'll wake just the first b->locked = false; struct proc *w = b->waiter; if(w) { b->waiter = w->next; unlock(&b->lock); } else { w->blocked = false; } unlock(&b->lock); } 4. What if you want more than one person to be able to access a file and you don't only need 1 process to open it. This method above where we have a boolean for 'locked' item in data structure is not good enough, because it is either on or off, locked or not, but what if two files can access it sumultaneously. In order to have multiple processes access, you add an item called // number of simult. locks allowed. int num_avail; where it counts how many available available slots (for processes to open) there are in the resource/file you are using 4 (part 2). In order to make sure that works, you have to increment 'num_avail' up when you release, and decrement 'num_avail' when you lock 5. Semaphores are an ancient way of implementing above (if you don't want to use Eggert's code). Semaphores are essentially blocking mutexes, but not quite. P(sem) acquire V(sem) release 6. one of the problems could be with pipes. Because the item could be available (like a pipe), however it could still not be available to write, if it is full. This is a problem because if 5 processes are trying to write to a full pipe, they will all open the pipe, look inside, and not write anything because it is full, but still pass the whole 'blocking mutex check' (because the file isn't locked, not one is using it, it is just full), context switch out (including all overhead that involves switching out), and do this over and over. 6.a. To fix this, we have a condition variable that is on top of the blocking mutex (alongside or check the condition after unlocking mutex). Like an expression describing what we're waiting for. Like for the pipe example, you can have an expression variable (p->w - p->r != 1024) // basically if the write pointer (p->w) and read // pointer (p->r) are not overlapping and therefore the // pipe is not full [considering a pipe that is // 1024 bytes big] 6.a. (cont) there's another issue, is that this condition is in the calling process' mind, and the O.S./kernel does not know about this boolean/true/not-true condition and it needs to know where the condition variable is. We can't use acquire() and release() anymore, we need a new function. For example for this pipe problem: void writec(struct pipe *p, char c) { acquire(&p->b); // now we are in the process and the scheduler has given // access to the process because the blocking mutex is // unlocked while(p->w - p->r == 1024) // there's our check :) { // This gives up the mutex and will not return // until the condition of wait is met. This can // wait a long time and will not use the CPU as // long as wait is not satisfied. // This should be used if the function itself is // heavyweight and will eat up the CPU each time it // runs, you really want to avoid using the CPU // for some big check and lots of operations if // they are going to fail (i.e. some pipe is full) // However if the operation to be done is simple, // then you can avoid using wait because the other // operation is going to end quickly and a few // quick checks like the one in the while loop // above are less of a loss than the overhead of // running hefty "wait()" function (like the one we do here) wait(&p->nonfull, &p->b); } // This will actually run the writec because the // non-full condition has been met (i.e. wait has // returned) // increments pointer to write 'p->w' and actually // writes the character 'c' p->buf[(p->(w++)) % 2014] = c; // You can also add a notify function that tells the // OS to look at all the blocked 'readc()' waiting // (using wait function for nonempty pipe so they can // start reading [because read functions can't read an // empty pipe]) and release them so they can return // from wait, thereby running their read successfully // and gaining access to the semaphore after this // function releases the semaphore notify(&b->nonempty); release(&p->b); } 7. There's another form of speedup that does not work with spinlocks (because spinlocks can keep doing context switches and "eat up the CPU"). Theres a hardware level implementation for pseudo-spinlocking for very small sections of code that is even more efficient than a spinlock but spinlocks are inefficient and spinlocks are problems when there is contention(lots of processes try to access the same resource/file) The solution 1: a. use two spinlocks that don't block each other. If there are multiple processes ****************************** HARDWARE LOCK ELISION(HLE) ****************************** We're like to go faster, especially for our spin locks. Our problem is that the spin locks inherently make everyone else wait. `xchgl` slows things down. A new technique on Haswell and beyond is called **Hardware Lock Elision**. What it does it caches the commands and waits to see if anyone else ends up messing with that piece of memory. it caches the commands in a register and they aren't actually 'DONE' until they are allowed and if anyone else messes with the part of memory that the CPU modified, the cached actions are deleted, thereby having atomic behavior (where either all the actions run or none of them do if the actions is invalid and steps on other process' toes) We do locking, but cheat to get performance. We execute a variant of xghl, and give a prefix to it: an x86 example of HLE lock: // Value to load into the mutex movl $1, %eax try: // Swap eax value with mutex xacquire lock xchgl %eax, mutex // eax should now hold zero cmp $0, %eax // Try again if failed; // jump if not zero jnz trying ret unlock: xrelease movl $0, m ret Everything done up until the xrelease is done provisionally in cache; nothing is stored in the memory until then. The CPU can check at the xrelease point and see if someone else acquired the lock. If they did, it jumps back to the xacquire and causes it to fail. It is like the xrelease never happened. We're relying on caches having enough memory to track everything. Thus, this only works for a limited amount of work Prof Eggert did a measurement with HLE on his code, even though this isn't a standard speedup that you will get, it is what he got for his program that was multithreaded in the beginning: mutithreaded version : got 6x performance INCREASE due to cached code due to all the CPU's not being busy single threaded version: got 2x DECREASE due to HLE causing a high amount of code to be cached, but then trashed 8. ****************************************************** DEADLOCK(DUN DUN DUN) ****************************************************** Now consider: Thread 1: tries to transfers from 19 to 27 Thread 2: tries to transer from 27 to 19 Now, thread 1 gets 19, thread 2 gets 27, and they then try to get a lock on the other account. They can't. Boom: deadlock. // code to exemplify this horrible situation: // What does this code do? Acquires a blocking mutex, and // another blocking mutex. Then releases both of them and in // between writes from one buffer into another only if one of // them is not-full and the other is non-empty void copyc(struct pipe *p, struct pipe *q) { acquire(&p->b); acquire(&q->b); if((p->w - p->r != 0) && (q->w - q->r != 1024)) { q->buf[q->w++%1024] = p->buf[p->w++%1024]; // ok = 1 } release(&p->b); release(&q->b); } This can't be prevented with fancier locks. It's a fundamental issue to lock operation. Hardware locks, conditional flags, etc cannot solve this either. We'll just spin faster or self-block Deadlock is a global, non-obvious property that's not apparent from looking at any one part of a system. By trying to solve one race condition, we introduced so many locks that we got another race condition: deadlock. Can we always prevent deadlock or do we accept the possibility of deadlock? #### Conditions for Deadlock: ##### Mutual Exclusion If one thing has acquired a resource, no one else can access it. ##### Circular Wait It's a cycle in the graph of processes waiting for each other to release something. It could be a long, convoluted cycle. It could even be a process waiting on itself. ##### No pre-emption of locks Many OS' offer the ability to override locks. We override the lock, force the other person off, and we grab it for ourselves. If pre-emption is offered, no deadlock. ##### Hold and wait One resource is held while waiting for another resource to become available. Without this ability, deadlock is impossible. For example, a system that only allows one lock to be held at a time. PREVENTING ALL OF THE ABOVE CONDITIONS PREVENTS DEADLOCK #### Solving Deadlock ##### Detecting Deadlocks Dynamically Here, the OS tracks every process and the resources they hold or are trying to hold. It builds a directed graph and makes sure it's acyclic. If a cycle were to be established based on the new arc, then the lock request will be rejected, giving an error status like EDEADLOCK for lock(). The kernel is helping applications, but apps must now be modified to watch out for deadlock. ##### Redesign so that deadlocks cannot occur One way to do it: **lock ordering**. If a primitive needs more than one lock, we impose this lock ordering algorithm: 1. We assume that locks are ordered and numbered. On Linux, we can just look at the memory address: lock_t a, b; // if address is before current address, lock isn't given &a < &b; 2. We acquire locks in order. If all successful, we're fine. But if a lock is not available, instead of waiting on that lock, we release all locks and start over. This assumes for blocking mutexes that we can try to acquire without waiting: try_and_acquire(). It never blocks; it just immediately fails. The only people that wait are the ones waiting for their first resource. No one holds and waits, so the fourth condition of hold-and-wait is broken. // Theres still a problem, if processes get locks out of // order, then you can still get deadlock. So you have to have // an agreement about the order locks are acquired, and if // that order is not preserved, then we will have a huge // problem because deadlock can occur at random 9. ### Priority Inversion apparently good programmers/mars rover suffered from this problem detailed: T1(low priority) T2(medium priority) T3(highest priority) runnable waiting waiting acquire(&b); // runnable context switch -> //now runnable acquire(&b); // is waiting context switch again because T1 has 'b' // becomes runnable // has CPU and runs // a long time() // so never return to // T1 to finish its // critical section // because T2 and T3 // will always be // running and no one // cares about letting // T1 release its lock // all context switches // will switch between // T2/T3 T_high keeps the antenna pointed at Earth. T_medium does something like check the battery. T_low does some background work. Suppose we begin with: T_low runnable T_med waiting T_high waiting T_low is running, so it grabs a lock. But then, we get a context switch to T_high, which just became runnable. It tries to grab the same object, but it can't, because T_low has it. T_high is blocked, and yields. We shoud get a context switch back to T_low, but suppose T_med is now runnable (woken up by some other previously blocking process). The scheduler chooses T_med over T_low, and so T_low never gets to run--which means T_high never gets to run. To solve? 1. Adding pre-emption: T_high can beat off T_low But we risk data corruption with this. 2. Temporarily elevate the priority of a process that holds a lock on a resource desired by a high-priority process. We borrow the priority of the other process. 10. ### Livelock There is livelock, where there the scheduling overhead becomes too much, and reaches the CPU limit because incoming packet(s) cause an interrupt int handler: { take bytes; put in queue; new_packet(); } Suppose a bounded buffer that receives requests at up to 100 Gb/s But we can only service them at 10 Gb/s Our goal graph: services done: ---------- / / / 10 Gb/s is limit (input load) Service really has more than one part: 1. The actual computation 2. Interrupt service handler for packet-handling chews up more and more time as more packets flood in The buffer may fill up, but we'll spend time tossing and accepting new packets. We'll always be doing something: handling a packet or computing (a little bit), but performance sucks because we will be handling/queueing new packets instead of performing the computation for what the packet is asking for: .-----. / \ / \ / \ / \ Simple solution: Mask out interrupts (drop all new packets) once the buffer is full and perform the ones that are all queued up, and re-enable interrupts afterward. 11. ### Event-driven programming **Locking code is a bug magnet (too hard as well)** We'd love to eliminate locks altogether, while still preventing races and attaining performance. The competing approach, used especially in embedded systems: event-driven programming. First thing we do: toss multi-threading. Only one thread now. We still want to do deal with asynch. events. Fundamentally, threads are used alongside partitioning of the work. Here, we instead break up computation into a series of events. Then, our program will consist of a set of event handlers. At runtime, we'll have a queue of incoming events. Computation now pops off the first event from the queue, runs its handler, and then we loop through the queue. Where there is no processes bothering each other, and each of them does not need to lock, acquire(), or do communicating between processes.