Health Monitoring and Recovery
Introduction
Suppose that a system seems to have locked up ... it has not recently
made any progress. How would we determine if the system was deadlocked?
- identify all of the blocked processes.
- identify the resource on which each process is blocked.
- identify the owner of each blocking resource.
- determine whether or not the implied dependency graph contains any loops.
How would we determine that the system might be wedged, so that
we could invoke deadlock analysis?
It may not be possible to identify the owners of all
of the involved resources, or even all of the resources.
Worse still, a process may not actually be blocked, but merely waiting
for a message or event (that has, for some reason, not yet been sent).
If we did determine that a deadlock existed, what would we do?
Kill a random process? This might break the circular dependency, but
would the system continue to function properly after such an action?
Formal deadlock detection in real systems ...
- is difficult to perform
- is inadequate to diagnose most hangs
- does not enable us to fix the problem
Fortunately there are better techniques that are far more
effective at detecting, diagnosing, and repairing a much
wider range of problems: health monitoring and managed
recovery.
Health Monitoring
We said that we could invoke deadlock detection whenever we thought
that the system might not be making progress. How could we know
whether or not the system was making progress? There are many
ways to do this:
- by having an internal monitoring agent watch
message traffic or a transaction log to determine whether or
not work is continuing
- by asking clients to submit failure reports to
a central monitoring service when a server appears
to have become unresponsive
- by having each server send periodic heart-beat
messages to a central health monitoring service.
- by having an external health monitoring service
send periodic test requests to the service that
is being monitored, and ascertain that they are
being responded to correctly and in a timely fashion.
Any of these techniques could alert us of a potential deadlock,
livelock, loop, or a wide range of other failures. But each
of these techniques has different strengths and weaknesses:
- heart beat messages can only tell us that the node and
application are still up and running. They cannot
tell us if the application is actually serving
requests.
- clients or an external health monitoring service can determine
whether or not the monitored application is responding
to requests. But this does not mean that some other
requests have not been deadlocked or otherwise wedged.
- an internal monitoring agent might be able to monitor
logs or statistics to determine that the service is
processing requests at a reasonable rate (and perhaps
even that no requests have been waiting too long).
But if the internal monitoring agent fails, it may
not be able to detect and report errors.
Many systems use a combination of these methods:
- the first line of defense is an internal monitoring
agent that closely watches key applications to
detect failures and hangs.
- if the internal monitoring agent is responsible for
sending heart-beats (or health status reports) to
a central monitoring agent, a failure of the internal
monitoring agent will be noticed by the central monitoring
agent.
- an external test service that periodically generates
test transactions provides an independent assessment
that might include external factors (e.g. switches, load
balancers, network connectivity) that would not be tested
by the internal and central monitoring services.
Managed Recovery
Suppose that some or all of these monitoring mechanisms determine
that a service has hung or failed. What can we do about it?
Highly available services must be designed for restart,
recovery, and fail-over:
- The software should be designed so that any process
in the system can be killed and restarted at any time.
When a process restarts, it should be able to reestablish
communication with the other processes and resume working
with minimal disruption.
- The software should be designed to support multiple
levels of restart. Examples might be:
- warm-start ... restore the last saved
state (from a database or from information
obtained from other processes) and resume
service where we left off.
- cold-start ... ignore any saved state
(which may be corrupted) and restart
new operations from scratch.
- reset and reboot ... reboot the
entire system and then cold-start
all of the applications.
- The software might also designed be for progressively
escalating scope of restarts:
- restart only a single process, and expect
it to resync with the other processes when
it comes back up.
- maintain a list of all of the processes
involved in the delivery of a service, and
restart all processes in that group.
- restart all of the software on a single node.
- restart a group of nodes, or the entire system.
Designing software in this way gives us the opportunity to begin
with minimal disruption, restarting only the process that seems
to have failed. In most cases this will solve the problem, but
perhaps:
- process A failed as a result of an incorrect request
received from process B.
- the operation that caused process A to fail is still
listed in the database, and when process A restarts,
it may attempt to re-try the same operation and fail
again.
- the operation that caused process A to fail may have
been mirrored to other systems, that will also
experience or cause additional failures.
For all of these reasons it is desirable to have the ability to
escalate to progressively more complete restarts of a progressively
wider range of components.
False Reports
Ideally a problem will be found by the internal monitoring agent on the
affected node, which will automatically trigger a restart of the
affected software on that node. Such prompt local action has the
potential to fix the problem before other nodes even notice that
there was a problem.
But suppose a central monitoring service notes that it has not received
a heart-beat from process A. What might this mean?
- It might mean that process A's node has failed.
- It might mean that the process A has failed.
- It might mean that the process A's system is loaded and the heart-beat message was delayed.
- It might mean that a network error prevented or delayed the delivery
of a heart-beat message.
- It might mean there is a problem with the central monitoring service.
Declaring a process to have failed can potentially be a very expensive operation.
It might cause the cancellation and retransmission of all requests that had been
sent to the failed process or node. It might cause other servers to start trying
to recover work-in-progress from the failed process or node. This recovery
might involve a great deal of network traffic and system activity. We don't
want to start an expensive fire-drill unless we are pretty certain that a process
has actually failed.
- the best option would be for a failing system to detect its own
problem, inform its partners, and shut-down cleanly.
- if the failure is detected by a missing heart-beat, it may be
wise to wait until multiple heart-beat messages have been missed
before declaring the process to have failed.
- to distinguish a problem with a monitored system from a problem
in the monitoring infrastructure, we might want to wait for multiple
other processes/nodes to notice and report the problem.
There is a trade-off here:
- If we do not take the time to confirm suspected failures, we may
suffer unnecessary service disruptions from forcing
unnecessary fail-overs from healthy servers.
- If we mis-diagnose the cause of the problem and restart the
wrong components we may make the problem even worse.
- If we wait too long before initiating fail-overs, we are prolonging
the service outage.
These so-called "mark-out thresholds" often require a great deal of tuning.
Many systems evolve complex decision algorithms to filter and reconcile potentially
conflicting reports to attempt to infer what the most likely cause of of
a problem is, and the most effective means of dealing with it.
Other Managed Restarts
As we consider failure and restart, there are two other interesting types
of restart to note:
- non-disruptive rolling upgrades
-
If a system is capable of operating without some of its nodes, it
is possible to achieve non-disruptive rolling software upgrades.
We take nodes down, one-at-a-time, upgrade each to a new software
release, and then reintegrate them into the service. There are
two tricks associated with this:
- the new software must be up-wards compatible with
the old software, so that new nodes can interoperate
with old ones.
- if the rolling upgrade does not seem to be working,
there needs to be an automatic fall-back option
to return to the previous (working) release.
- prophylactic reboots
-
It has long been observed that many software systems become
slower and more error prone the longer they run. The most
common problem is memory leaks, but there are other types of
bugs that can cause software systems to degrade over time.
The right solution is probably to find and fix the bugs ...
but many organizations seem unable to do this. One popular
alternative is to automatically restart every system at
a regular interval (e.g. a few hours or days).
If a system can continue operating in the face of node failures,
it should be fairly simple to shut-down and restart nodes
one at a time, on a regular schedule.