Health Monitoring and Recovery
Introduction
Suppose that a system seems to be wedged ... it has not recently
made any progress.  How would we determine if the system was deadlocked?
    -  identify all of the blocked processes.
 
    -  identify the resource on which each process is blocked.
 
    -  identify the owner of each blocking resource.
 
    -  determine whether or not the implied graph contains any loops.
 
How would we determine that the system might be wedged, so that
we could invoke deadlock analysis?
It may not be possible to identify the owners of all 
of the involved resources, or even all of the resources.  
Worse still, a process may not actually be blocked, but merely waiting 
for a message or event (that has, for some reason, not yet been sent).
And if we did determine that a deadlock existed, what would we do?  
Kill a random process?  This might break the circular dependency, but
would the system continue to function properly after such an action?
Formal deadlock detection in real systems is:
Formal deadlock detection in real systems ...
    -  is difficult to perform
 
    -  is inadequate to diagnose most hangs
 
    -  does not tell us how to fix the problem
 
Fortunately there is a simpler technique, that is far more
effective at detecting, diagnosing, and repairing a much 
wider range of problems: health monitoring and managed
recovery.
Health Monitoring
We said that we could invoke deadlock detection whenever we thought
that the system might not be making progress.  How could we know 
whether or not the system was making progress?  There are many
ways to do this:
    -  by having an internal monitoring agent watch
         message traffic or a transaction log to determine whether or 
	 not work is continuing
 
    -  by having the service send periodic heart-beat
    	 messages to a health monitoring service.
 
    -  by having an external health monitoring service
    	 send periodic test requests to the service that
	 is being monitored, and ascertain that they are
	 being responded to correctly and in a timely fashion.
 
Any of these techiques could alert us of a potential deadlock,
livelock, loop, or a wide range of other failures.  But each
of these techniques has different strengths and weaknesses:
    -  heart beat messages can only tell us that the node and
         application are still up and running.  They cannot
	 tell us if the application is actually serving
	 requests.
 
    -  an external health monitoring service can determine
    	 whether or not the monitored application is responding 
	 to requests.  But this does not mean that some other
	 requests have not been deadlocked or otherwise wedged.
 
    -  an internal monitoring agent might be able to monitor
         logs or statistics to determine that the service is
	 processing requests at a reasonable rate (and perhaps
	 even that no requests have been waiting too long).
	 But if the internal monitoring agent fails, it may
	 not be able to detect and report errors.
 
Many systems use a combination of these methods:
    -  the first line of defense is an internal monitoring
         agent that closely watches key applications to 
	 detect failures and hangs.
 
    -  if the internal monitoring agent is responsible for
         sending heart-beats (or health status reports) to
	 a central monitoring agent, a failure of the internal
	 monitoring agent will be noticed by the central monitoring
	 agent.
 
    -  an external test service that periodically generates
         test transactions provides an independent assessment
	 that might include external factors (e.g. switches, load
	 balancers, network connectivity) that would not be tested
	 by the internal and central monitoring services.
 
Managed Recovery
Suppose that some or all of these monitoring mechanisms determine
that a service has hung or failed.  What can we do about it?
Highly available services must be designed for restart,
recovery, and fail-over:
   -  The software should be designed so that any process
   	in the system can be killed and restarted at any time.
	When a process restarts, it should be able to reestablish 
	communiction with the other processes and resume working
	with minimal disruption. 
 
    - The software should be designed to support multiple
        levels of restart. Examples might be:
	
		-  warm-start ... restore the last saved
		     state (from a database or from information
		     obtained from other processes) and resume
		     service where we left off.
 
		-  cold-start ... ignore any saved state
		     (which may be corrupted) and restart
		     new operations from scratch.
 
	
     
    - The software might also designed for progressively
        escalating scope of restarts:
	
		-  restart only a single process, and expect
		     it to resync with the other processes when
		     it comes back up.
 
		-  restart all of the software on a single node.
 
		-  restart a group of nodes, or the entire system.
 
	
     
Designing software in this way gives us the opportunity to begin
with minimal disruption, restarting only the process that seems
to have failed.  In most cases this will solve the problem, but
perhaps:
   - 	process A failed as a result of an incorrect request
        received from process B.
 
   -  the operation that caused process A to fail is still
   	listed in the database, and when process A restarts,
	it may attempt to re-try the same operation and fail
	agian.
 
   -  the operation that caused process A to fail may have
        been mirrored to other systems, that will also 
	experience or cause additional failures.
 
For all of these reasons it is desirable to have the ability to
escalate to progressively more complete restarts of a progressively
wider range of components.  
False Reports
Ideally a problem will be found by the internal monitoring agent on the
affected node, which will automatically trigger a restart of the
affected software on that node.  Such prompt local action has the
potential to fix the problem before other nodes even notice that
there was a problem.
But suppose a central monitoring service notes that it has not received
a heart-beat from process A.  What might this mean?
   - It might mean that the node has failed.
 
   - It might mean that the process has failed.
 
   - It might mean that the system is loaded and the heart-beat message was delayed.
 
   - It might mean that a network error prevented or delayed the delivery 
       of a heart-beat message.
 
Declaring a process to have failed can potentially be a very expensive operation.
It might cause the cancellation and retransmission of all requests that had been
sent to the failed process or node.  It might cause other servers to start trying
to recover work-in-progress from the failed process or node.  And this recovery
might involve a great deal of network traffic and system activity.  We don't
want to start an expensive fire-drill unless we are pretty certain that a process
has failed.
   -  the best option would be for a failing process to detect its own 
        problem, inform its partners, and shut-down cleanly.
 
   -  if the failure is detected by a missing heart-beat, it may be 
        wise to wait until multiple heart-beat messages have been missed
	before declaring the process to have failed.
 
   -  in some cases, we might want to wait for multiple other 
        processes/nodes to complain.
 
But there is a trade-off here.  If we do not take the time to confirm 
suspected failures, we may suffer unnecessary service disruptions from
forcing fail-overs from healthy servers.  On the other hand, if we wait
too long before initiating fail-overs, we are prolonging the service
outage.  These so-called "mark-out thresholds" often require a great
deal of tuning.
Other Managed Restarts
As we consider failure and restart, there are two other interesting cases to note: