Health Monitoring and Recovery

Introduction

Suppose that a system seems to have locked up ... it has not recently made any progress. How would we determine if the system was deadlocked?

How would we determine that the system might be wedged, so that we could invoke deadlock analysis?

It may not be possible to identify the owners of all of the involved resources, or even all of the resources.

Worse still, a process may not actually be blocked, but merely waiting for a message or event (that has, for some reason, not yet been sent).

If we did determine that a deadlock existed, what would we do? Kill a random process? This might break the circular dependency, but would the system continue to function properly after such an action?

Formal deadlock detection in real systems ...
  1. is difficult to perform
  2. is inadequate to diagnose most hangs
  3. does not enable us to fix the problem

Fortunately there are better techniques that are far more effective at detecting, diagnosing, and repairing a much wider range of problems: health monitoring and managed recovery.

Health Monitoring

We said that we could invoke deadlock detection whenever we thought that the system might not be making progress. How could we know whether or not the system was making progress? There are many ways to do this:

Any of these techniques could alert us of a potential deadlock, livelock, loop, or a wide range of other failures. But each of these techniques has different strengths and weaknesses:

Many systems use a combination of these methods:

Managed Recovery

Suppose that some or all of these monitoring mechanisms determine that a service has hung or failed. What can we do about it? Highly available services must be designed for restart, recovery, and fail-over:

Designing software in this way gives us the opportunity to begin with minimal disruption, restarting only the process that seems to have failed. In most cases this will solve the problem, but perhaps:

For all of these reasons it is desirable to have the ability to escalate to progressively more complete restarts of a progressively wider range of components.

False Reports

Ideally a problem will be found by the internal monitoring agent on the affected node, which will automatically trigger a restart of the affected software on that node. Such prompt local action has the potential to fix the problem before other nodes even notice that there was a problem.

But suppose a central monitoring service notes that it has not received a heart-beat from process A. What might this mean?

Declaring a process to have failed can potentially be a very expensive operation. It might cause the cancellation and retransmission of all requests that had been sent to the failed process or node. It might cause other servers to start trying to recover work-in-progress from the failed process or node. This recovery might involve a great deal of network traffic and system activity. We don't want to start an expensive fire-drill unless we are pretty certain that a process has actually failed.

There is a trade-off here:

These so-called "mark-out thresholds" often require a great deal of tuning. Many systems evolve complex decision algorithms to filter and reconcile potentially conflicting reports to attempt to infer what the most likely cause of of a problem is, and the most effective means of dealing with it.

Other Managed Restarts

As we consider failure and restart, there are two other interesting types of restart to note:

non-disruptive rolling upgrades

If a system is capable of operating without some of its nodes, it is possible to achieve non-disruptive rolling software upgrades. We take nodes down, one-at-a-time, upgrade each to a new software release, and then reintegrate them into the service. There are two tricks associated with this:

prophylactic reboots

It has long been observed that many software systems become slower and more error prone the longer they run. The most common problem is memory leaks, but there are other types of bugs that can cause software systems to degrade over time. The right solution is probably to find and fix the bugs ... but many organizations seem unable to do this. One popular alternative is to automatically restart every system at a regular interval (e.g. a few hours or days).

If a system can continue operating in the face of node failures, it should be fairly simple to shut-down and restart nodes one at a time, on a regular schedule.