Wednesday, April 02, 2008

Healthy Health Checks?

Here is a little story that happened to a friend of mine a few years ago. Users of his web application started to complain about the system being broken and not responding anymore. The curious thing was that the operation team was not aware of any issue. After checking with them, it appeared that the monitor they had in place for this web application was simply checking if an HTTP response was received. Any response. Even a 500 one!

This sounds naive and ridiculous but setting up application monitoring is a subject that is a little more hairy than it appears at first glance. Consider another more recent case that came to my attention: in this case, the application was still replying positively to its health check monitor but was not functioning properly, as it was unable to access required file system resources. Again, the end users were affected while the monitoring was happily receiving correct responses from the application.

So how can we, software developers, create health checks that operations can rely on?

Taking the canonical multi-tiered web application as an example, the following schema shows an health check that is too shallow to be useful (in red) and one that exercises the full layer depth (in green).
While it is clear that the shallow approach brings little value, as far as end user quality of service is concerned, why do not we always shoot for the deep approach then?

Well, if you consider how a serious load balancer appliance (like BIG-IP) works, you will realize that if performs health checks very regularly (by default every 5 seconds) in order to have the most up to date view of the sanity of the members of the pools it handles. Bearing this mind, if an health check request would exercise the full depth of an application, you would have a permanent load added to your system, which would increase the strain on your diverse resources, down to the database itself. With a farm of n servers, the cumulated strain induced by the health check requests on all the members of it would start to be non negligible on any shared resource.

My take on this would be the following: create an internal watchdog that evaluates the sanity of the application at a reasonable pace and report the current state of this watchdog when a monitor requests a health check from the application.

As shown in the above schema, the watchdog life cycle is uncoupled from the health check one, which allows to reduce strain on the underlying resources while allowing the monitoring environment to become aware of an application issue almost as soon as the application realizes it itself (because the monitor polling frequency will be kept high).

What is your own experience in this field and what is the path you have followed in order to build dependable health checks?