Wednesday, April 02, 2008

Healthy Health Checks?

Here is a little story that happened to a friend of mine a few years ago. Users of his web application started to complain about the system being broken and not responding anymore. The curious thing was that the operation team was not aware of any issue. After checking with them, it appeared that the monitor they had in place for this web application was simply checking if an HTTP response was received. Any response. Even a 500 one!

This sounds naive and ridiculous but setting up application monitoring is a subject that is a little more hairy than it appears at first glance. Consider another more recent case that came to my attention: in this case, the application was still replying positively to its health check monitor but was not functioning properly, as it was unable to access required file system resources. Again, the end users were affected while the monitoring was happily receiving correct responses from the application.

So how can we, software developers, create health checks that operations can rely on?

Taking the canonical multi-tiered web application as an example, the following schema shows an health check that is too shallow to be useful (in red) and one that exercises the full layer depth (in green).
While it is clear that the shallow approach brings little value, as far as end user quality of service is concerned, why do not we always shoot for the deep approach then?

Well, if you consider how a serious load balancer appliance (like BIG-IP) works, you will realize that if performs health checks very regularly (by default every 5 seconds) in order to have the most up to date view of the sanity of the members of the pools it handles. Bearing this mind, if an health check request would exercise the full depth of an application, you would have a permanent load added to your system, which would increase the strain on your diverse resources, down to the database itself. With a farm of n servers, the cumulated strain induced by the health check requests on all the members of it would start to be non negligible on any shared resource.

My take on this would be the following: create an internal watchdog that evaluates the sanity of the application at a reasonable pace and report the current state of this watchdog when a monitor requests a health check from the application.

As shown in the above schema, the watchdog life cycle is uncoupled from the health check one, which allows to reduce strain on the underlying resources while allowing the monitoring environment to become aware of an application issue almost as soon as the application realizes it itself (because the monitor polling frequency will be kept high).

What is your own experience in this field and what is the path you have followed in order to build dependable health checks?

1 comment:

Perrine said...

The monitoring policy is not an easy case to manage.

Ie : you may check the availbility the components of your application : database, applications servers, webservers, proxies and so on
It is the systems check.

You may also check your application itself with some scenarii. It is the functional monitoring.

The third aspect is the load of your monitoring ( network, cpu, I/O, logging, ... )

The way I prefer, for system, is high frequency local simple check and a more powerful check if something goes wrong.

On application side, it is more difficult as developement team must be involved to make the application be able to check itself and release alerts.
Components must watch other components which are linked and some deprecated process must be planned.
I.E connexion lost implies requests can't be fullfilled.
It is simply error management which connect to the monitoring process.

Databases are a good example with statistics collect and analyze to detect and anticipate malfunction.

Then a well documented recovery policy must indicate the way to restore the service.

Simple to say, very difficult to implement as monitoring is event and message management with no standard at all.