Tuesday, October 26, 2010

Listen to Your Applications

Applications have lots to say. Here's how I've learned to listen to them.

I have recently been involved in the development of a highly distributed cloud application. We were a small team and wanted to remain agile all the way through. We had extensive testing and continuous integration in place from day one giving us plenty of feedback during development, a feedback that is essential for building the right thing and building it well.

But what about production time?

We wanted to get feedback from this part of the application life cycle too, therefore we've decided to build and configure many different feedback sources so our application could speak to us.

And speaking it did. It actually provided us precious feedback on three very different aspects of itself: user experience, design and implementation.

I have tried to represent the different feedback sources we've baked into our application and to what domain they belong to on this lovely triangular diagram:


Let me detail each feedback source:
  • Activity Log - This is a detailed audit trail of each and every user action you can capture. It provides detailed feedback on your features and how you've made them usable or not. Storing this data in a PostgreSQL partitioned table did well for us. With higher volumes, you may want to go NoSQL.
  • Error Log - An embarrassing stack festival that may or may not have direct impact on the end user. No need to mention that this log is best kept empty. A service like Hoptoad can help you with that by putting errors in your face until you resolve them.
  • Trace Log - This is where you take the true measure of what your application is actually doing, which is less than obvious in highly distributed applications. Logging correlation IDs and aggregating logs in a central place via syslog or Scribe is a good approach. You'll need searching capacities in these logs: think Clarity or Splunk, depending on your constraints and budget.
  • Response Time - This is an obvious metric that will shed some light on your design and implementation. Just be sure you're logging it and paying attention to it.
  • DB TPS - Though outside of the pure realm of your application feedback loop, this metric gives you a good measure on how DB intensive is your application and if it needs some redesign, like for example some low hanging fruits where caching could help.
  • Cache Hit/Miss - Caching brings as much problems as it solves: a cache-happy application doesn't come for free, especially if it is distributed. Measuring the hit/miss ratio on each cache can help validate their usefulness or lack thereof.
  • MQ Throughput - Monitoring of queues for high watermark thresholds is commonly done outside of the application's realm. An interesting MQ-related data an application can log is the time a message has been in-flight, including, or not, the processing time of the message after it's been consumed.
  • Activity Intensity - This last one is a fun one: by representing the number of active application sessions and the current database activity, you can get a great idea of how active (or bored) are your users.
Let me mention a single benefit of this approach: thanks to the detailed activity log, we've been able to spot design issues that were preventing users to make full use of some features. And we've been able to fix these issues not based on assumptions or wild guesses but on measured data.

Your applications want to talk to you: do you listen to them? How do you do it?