Tuesday, October 26, 2010

Listen to Your Applications

Applications have lots to say. Here's how I've learned to listen to them.

I have recently been involved in the development of a highly distributed cloud application. We were a small team and wanted to remain agile all the way through. We had extensive testing and continuous integration in place from day one giving us plenty of feedback during development, a feedback that is essential for building the right thing and building it well.

But what about production time?

We wanted to get feedback from this part of the application life cycle too, therefore we've decided to build and configure many different feedback sources so our application could speak to us.

And speaking it did. It actually provided us precious feedback on three very different aspects of itself: user experience, design and implementation.

I have tried to represent the different feedback sources we've baked into our application and to what domain they belong to on this lovely triangular diagram:

Let me detail each feedback source:
  • Activity Log - This is a detailed audit trail of each and every user action you can capture. It provides detailed feedback on your features and how you've made them usable or not. Storing this data in a PostgreSQL partitioned table did well for us. With higher volumes, you may want to go NoSQL.
  • Error Log - An embarrassing stack festival that may or may not have direct impact on the end user. No need to mention that this log is best kept empty. A service like Hoptoad can help you with that by putting errors in your face until you resolve them.
  • Trace Log - This is where you take the true measure of what your application is actually doing, which is less than obvious in highly distributed applications. Logging correlation IDs and aggregating logs in a central place via syslog or Scribe is a good approach. You'll need searching capacities in these logs: think Clarity or Splunk, depending on your constraints and budget.
  • Response Time - This is an obvious metric that will shed some light on your design and implementation. Just be sure you're logging it and paying attention to it.
  • DB TPS - Though outside of the pure realm of your application feedback loop, this metric gives you a good measure on how DB intensive is your application and if it needs some redesign, like for example some low hanging fruits where caching could help.
  • Cache Hit/Miss - Caching brings as much problems as it solves: a cache-happy application doesn't come for free, especially if it is distributed. Measuring the hit/miss ratio on each cache can help validate their usefulness or lack thereof.
  • MQ Throughput - Monitoring of queues for high watermark thresholds is commonly done outside of the application's realm. An interesting MQ-related data an application can log is the time a message has been in-flight, including, or not, the processing time of the message after it's been consumed.
  • Activity Intensity - This last one is a fun one: by representing the number of active application sessions and the current database activity, you can get a great idea of how active (or bored) are your users.
Let me mention a single benefit of this approach: thanks to the detailed activity log, we've been able to spot design issues that were preventing users to make full use of some features. And we've been able to fix these issues not based on assumptions or wild guesses but on measured data.

Your applications want to talk to you: do you listen to them? How do you do it?

Wednesday, September 15, 2010

DevOps: Time for Agile Operations!

I've made a little xtranormal movie to introduce DevOps on the blog of AgilePartner.

Go check it out!

Friday, September 10, 2010

Erlang + Cloud Files = cferl

To celebrate the return of CloudCamp in Vancouver, I'm happy to announce the very first release of cferl, the Erlang API for Rackspace Cloud Files.

cferl fully implements the current version of the Cloud Files API. With it, you can very easily create and manage storage containers and the data objects they contain. You also have full control over the publication of your data objects on Rackspace's CDN.

Here is a short example that demonstrates a few operations using cferl:

That's all it takes to create a container, add an XML document into it and make it available to the rest of the world over CDN!

To probe further, you can:
Enjoy cferl!

Sunday, September 05, 2010

Recently Reviewed: Patterns-Based Engineering

From time to time, I participate in technical book reviews. Here is my account for a book I've recently reviewed.

Pattern galore, a term that aptly describes one of the worst nightmare of software craftsmen. No-one wants to come close to a system that has been dragged into a design hell created by pattern-overenthusiastic programmers. A craftsman wants his design to be constrained by requirements and his code to be written by hand.

Alas not all code is born equal. Some languages and platforms insist on the creation of tedious scaffolding code before starting to tackle the real meat of the problem. And large projects imply the repetition of this code ad nauseam. Moreover, many have been lost in the quest for the long sought holy grail of code re-use.

Not deterred by this difficult context and heavy history, Ackerman and Gonzalez have decided to present a pragmatic approach to using patterns in software engineering. And they did great. In truth, Patterns-Based Engineering is faithful to its tagline: Successfully Delivering Solutions via Patterns.

Rest assured that snake-oil is not in the catalog of the authors: the book is as concrete as possible, organized as a manual for kick-starting a rational approach to using patterns. The authors took time to debunk pattern-related misconceptions and, this, for a reason: there is a lot to do to polish the image of design patterns. I believe this book is an essential first step in the right direction.

Monday, May 24, 2010

Data Interaction Patterns

Throughout my experience with working on back-end systems for anything from big governmental to online gaming, I have came to develop a particular appreciation of the interactions that happen between data consumers and data producers. The following is a non-exhaustive and non-authoritative review of the different data interaction patterns that I've came up to play with. These are mostly unstructured notes from my experience in the field that I hope may turn useful to others.

As you know, when data is involved caching comes into play when performance and scalability are sought. In the coming diagrams, cache is represented as a vertical rectangle. The persistent storage is represented as a vertical blue cylinder, while horizontal cylinders represent some form of reliable and asynchronous message delivery channels. The data interactions are represented with curvy arrows: they can represent reading or writing.

Direct [R/W]

Besides the obvious drawbacks coming from the temporal coupling with the persistent storage mechanism, the interesting thing to note in such a trivial data access pattern is that there is often some form of request-scoped caching happening without the need to explicitly do anything. This first level of cache you get from data access layers help in optimizing operations provided they occur in the same request (to which is bound the transaction, if one exists).

Being short lived, this kind of caching is free from the problem of expired cache entries eviction: it can kick in transparently without the application being aware of it.

Through Cache [R/W]

Reading through cache is a simple and powerful mechanism where an application tries first to read from a long lived cache (a very cheap operation) and, if the requested data can't be found, proceeds with a read in the persistent storage (a way more expensive operation).

It's interesting to note that write operations don't necessarily happen the same way, ie. it is well possible that a write to the persistent storage doesn't perform a similar write in the cache. Why is that? Cached data is often a specific representation of the data available in the storage: it can be for example an aggregation of different data points that correspond to a particular cache key. The same persistent data can lead to the creation of several different cache entries. In the case, a write can simply lead to an immediate cache flush, waiting for subsequent read operations to repopulate these entries with new data.

Conversely, it's possible to have write operations update the cache, which opens the interesting problem of consistency. In the current scenario, the persistent storage remains the absolute truth of consistency: the application must handle the case when the cache was inconsistent and led to an invalid data operation in the persistent storage. I've found that localized cache evictions work well: the system goes through a little hiccup but quickly restores its data sanity.

Though some data access technologies allow the automatic management of this kind of second level of caching, I personally prefer that my applications have an explicit interaction with the caching technology they use, and this at the service layer. This is especially true when considering distributed caching and the need to address the inherent idiosyncrasies of such a caching model.

Cache distribution or clustering is not compulsory though: you can reap the benefits of reading through cache with localized caches but at the expense of needing to establish some form of stickiness between the data consumers and the providers (for example, by keeping a user sticky to a particular server based on its IP or session ID).

This said, stickiness skews load balancing and doesn't play well when you alter a pool of servers: I've really became convinced that you get better applications by preventing stickiness and letting requests hit any server. In that case, cache distribution or clustering becomes necessary: the former presents some challenges (like getting stale data after a repartition of the caching continuum) but scales better than the latter.

Write Behind [W]

Writing behind consists in updating the data cache synchronously and then defer the writing to the persistent storage to an asynchronous process, through a reliable messaging channel.

This is possible with regular caching technologies if there is no strong integrity constraints or if it's acceptable to present temporarily wrong data to the data consumer. In case the application has strong integrity constraints, the caching technology must be able to become the primary source of integrity truth: consistent distributed cached that supports some form of transactional data manipulation becomes necessary.

In this scenario, the persistent storage doesn't enforce any form of data constraint, mostly because it is too hard to propagate violation issues back to the upstream layers in any meaningful form. One could wonder what is the point of using such a persistent storage if it is dumbed down to such a mundane role: if this storage is an RDBMS, there is still value in writing to it because external systems like a back-office or business intelligence tools often require to access a standard data store.

Cache Push [R]

Pushing to cache is very useful for data whose lifecycle is not related to the interactions with its consumers. This is valid for feeds or the result of expensive computations not triggered by client requests.

The mechanism that pushes to cache can be something like a scheduled task or a process consuming asynchronous message channels.

Future Read [R]

In this scenario, the data producers synchronously answers the consumers with the promise of the future delivery of the requested data. When available, this data is delivered to the client via some sort of server push mechanism (see next section).

This approach works very well for expensive computations triggered by client requests.

Server Push [R]

Server push can be used to complement any of the previous interactions: in that case, a process prepares some data and delivers it directly to the consumer. There are many well known technological approaches for this, including HTTP long-polling, AJAX/CometD, web sockets or AMQP. Enabling server push in an application opens the door to very interesting data interactions as it allows to decouple the activities of the data consumers and producers.

Monday, May 17, 2010

Infected but not driven

The least I can say is that I'm test infected: when a coverage report shows lines of code that are not exercised by any test, I can't help but freak out a little (unless it appears that this code is truly useless and can be mercilessly pruned). This quasi obsession for testing is not vain at all: time and again I have experienced the quality, stability and freedom of move a high test coverage gives me. Things work, regressions are rare and refactoring is a bliss thanks to the safety net tests provide.

So what's with TDD... and me?

There are some interesting discussions going on around TDD and its applicability, which I think are mostly fueled by the heavy insistence of TDD advocates on their particular way of approaching software development in general and testing in particular. The more time I spend thinking about these discussions, the more it becomes clear to me that as far as testing is concerned, the usual rule of precaution of our industry applies: ie. it depends.

To be frank, I'm having a hard time with the middle D in TDD: as I said, I'm test infected, low test coverage gives me the creeps, but my process of building software is not driven by tests. From an external viewpoint, it is driven by features so that would make it FDD. From my personal viewpoint, it is driven by gratification, which makes it GDD.

Being gratified when writing software is what has driven me since I'm a kid: I didn't spend countless hours hurting my fingers on a flat and painful ZX-81 keyboard for the sake of it. I did it to see my programs turned into tangible actions on the computer. It was gratifying. And this is what I'm still looking for when writing software.

But let's go back to the main point of this discussion: TDD. With all the industry notables heavy-weighting on writing code while being driven by test, should I conclude there's something wrong with my practice? Or is the insistence on test first just a way to have developers write tests at all?

Adding features to a system, at least for the kind of systems I'm working on, mainly consist in implementing a behavior and exposing it through some sort of a public interface. Let's consider these two activities and how testing relates to them.


When I write simple utility functions, like chewing on some binary or data structure and spitting out a result, I will certainly write tests first because I will be able to express the complete intended behavior of the function with these tests.

Unfortunately, most of the functions I write are not that trivial: they interact with functions in other modules in non-obvious manners (asynchronously) and support different failure scenarios. Following a common Erlang idiom, these functions often end up replying a simple ok: such a result is not enough to drive the development of the function (else fun() -> ok end would be the only function to write to be done). In fact, testing first this kind of functions implies expressing with mock expectations all the interactions that will happen when calling the top function. That's MDD (Mock Driven Development) and it's only a letter away from making me MAD. Sorry but writing mocks first makes me nauseous.

My approach to developing and testing complex functions is, to me, more palatable as it leads to a faster gratification: I start by creating an empty function. Then I fill it with a blueprint of the main interactions I am envisionning expressed as comments. Afterwards, I reify this blueprint by turning the few comments in the original function into a cascade of smaller functions. At this point, I fire-up the application and manually exercise the new function: this is when the fun begins as I see this new code coming to life, finding implementation bugs and fixing potential oversights. After being gratified with running code, I then proceed to unit test it thoroughly, exploring each failure scenarios with mocks and using a code coverage tool to ensure I haven't forgotten any execution branch in my tests.

This said, there is another behavior-related circumstance under which I will write tests first: when the implemented behavior is proven wrong. In that case, writing tests that make the problem visible before fixing it is the best approach to debugging as it deals with the problem of bad days and lurking regressions.

Public Interface

Writing usable modules imply designing interfaces that are convenient to use. Discussing good API design is way beyond the point here. The point is: could writing tests first be a good guide for creating good interfaces? The immediate answer is yes, as by eating your own dog food first makes you more inclined into cooking it into the best palatable form possible (anyone who has had to eat dog food, say while enduring hazing, knows this is a parabola).

In my practice, I have found things to be a little different, again for less than trivial functions, which unfortunately compose most of a complex production system. For this class of functions, I have found that the context of a unit test is seldom enough to fairly represent the actual context where the functions will be used. And consequently, the capacity to infer a well-designed interface based on these tests first and alone is not enough. Indeed, a unit test context is not reality: look at all the mocks in it, don't they make the whole set look like a movie stage? Do you think it's air you're breathing?

When creating non-trivial public functions, I've found a great help into going through a serious amount of code reading in the different places where it is envisioned these functions will be used. Reading a lot of code before writing a little of it is commonplace in our industry: while going through the reading phase, you're actually loading all sort of contextual information in your short term memory. Armed with such a mental model, it becomes possible to design new moving parts that will naturally fit in this edifice. So that I guess that practice would be RDD (reading driven development).

Daring to conclude?

I find it hard to conclude anything from the dichotomy between my practices as opposed to what TDD proponents advocate. I consider myself a well-rounded software professional producing code of reasonably good quality: unless I'm completely misguided about myself, I think the conclusion is that it's possible to write solid production code without doing it in a test-driven fashion. If you have the discipline to write tests, you can afford to not being driven by them.

Friday, May 14, 2010

Just Read: Zabbix 1.8 Network Monitoring

Since Zabbix 1.8 came out, I have been wanting to upgrade just for the sake of getting the new and improved AJAXy front-end. Indeed, the Achilles' heel of the previous versions of this otherwise very solid and capable monitoring platform, was the poorly responsive GUI. But I kept pushing the upgrade for a later date.

When the good folks at Packt Publishing offered me to take a peek at their brand new Zabbix book, my procrastination was over. Equipped with such a complete and up-to-date reference material, I had no reason for not taking the plunge and upgrade.

This 400+ pages book is not only welcome as a supporting resource when upgrading, it is also a consummate reference guide that was much needed by all Zabbix users. I've found the book to be easy to read, as it is loaded with screenshots, but also one step beyond than a pure user guide. Indeed, the author covers general subjects about application monitoring: for example, the section on SNMP is actually a very good introduction to this protocol, with tons of hands-on example to guide you through the learning path.

On the down side of things, as it is often the case with technical books, I have found the index to be wanting (it's a little short and sometimes deceiving). This is not a big deal though because, in order to make the most of this comprehensive book, it's a good idea to get the eBook version and use full text search to reach the information needed.

Whether you're using Zabbix and want to deepen your skills or want to learn about monitoring in practice, this book will get you covered. And if you don't want to take my word on this, download this free chapter and see for yourself!

Friday, April 30, 2010

Grafting Mule Endpoints

Note: The following code samples are applicable to Mule 2.

In Mule ESB, outbound dispatching to a destination whose address is known at runtime only is a pretty trivial endeavor. A less frequent practice consists in programmatically defining inbound service endpoints.

I recently had to do such thing for a little side project I'm running where Mule is used as a frontal bus and load throttler in front of a R nodes exposed over RMI. The goal was to have a non-fixed number of file inbound endpoints defined in a simple properties file and declare them on a particular service during the initialization sequence of Mule.

As an integration framework, Mule ESB exposes all its moving parts and lets you configure them easily with its Spring-powered XML DSL: that's all we need to achieve the above goal.

Let's first look at the resulting service configuration:

As you can see the inbound router doesn't have any endpoint configured on it. This is where we will programmatically graft the file endpoints configured in an external properties file.

Before digging into the code used for this grafting operation, let's look at how the grafter itself is configured:

Unsurprisingly, we use a Spring configured POJO to perform the endpoints generation. Notice how the service and the file connector are referenced: instead of using names I'm directly referencing Mule configuration elements. Because Spring is used consistently being the scene, this kind of cross referencing is possible and the key to many advanced tricks!

Now take a deep breath and take a look at the code in charge of grafting the endpoints to the target service:

The important things to pay attention to are the following:
  • The class implements MuleContextAware in order to receive an instance of the MuleContext, which is the key to the gate of Mule's innermosts. Some might consider fetching this class from the connector object that gets injected in this class too: I personally find this less desirable for design reasons that I'll let Demeter explain.
  • The endpoint is bound to the desired connector by passing its name in the URI used to create it. This allows picking up the right connector, which is compulsory for any Mule configuration with more than one instance of a particular connector (file connectors in this case).
  • Endpoint specific configuration parameters, like moveToDirectory, are configured as extra URI parameters. You can also add other parameters, as key/value pairs: they will be automatically added to the message properties dispatched from this endpoint.
And voila, though you may never have to do this kind of things in your Mule ESB projects, you've gained some deeper experience into what a reasonably skilled gardener can do with this powerful platform.

Wednesday, April 28, 2010

Just Read: 97 Things Every Programmer Should Know

As a collection of 2 pages essays on good software practices, the book offers a pretty heterogeneous reading experience. Despite that, the book is an pleasant and quick read, which covers all aspects of software development, from coding to testing and from technical to human-related concerns.

If you're familiar with the practices that the Agile, XP or Software Craftsmanship movements are putting forward, you'll find that you already knew and agreed with most of the book. In that case, the real value of this book will come from the few essays you'll find questioning or disagreeing with, as you will have to self-introspect and decide if your disagreement is founded or based on prejudices.

In conclusion, I think this book will often be found in the "must read" list of books that teams provide to their junior programmers.

Wednesday, April 14, 2010

Just Read: Coders at Work

As an interview-based book, Coders at Work does a pretty good job at exploring the minds, memories and practices of an impressive bunch of software old timers.

To me, the main downside of this book is that it is, with a few exceptions, mainly focused on a pretty homogeneous group of people, i.e. US-based coders who started on PDP-*.

The book could have used a little more diversity because it's main value lies in the analysis that us, the readers, will do while reading about the lives of these arch-coders. More diversity would have made the commonality between top coders more salient, while in the book, commonalities feel they occur simply because most of these people worked at the same period of time on the same type of machines.

Besides that, it's definitely worth the read.

Thursday, February 25, 2010

The Holy Grail Of Persistence?

One of the very first CTO-grade decision I had to take in the making of Snoget was to pick what would become our main transactional persistence engine. Since we're using Erlang exclusively for our production servers, the solution seemed easy: use Mnesia. But I settled for PostgreSQL.

At this point, anyone who's been dealing with O/R mapping (like Ted Neward who said: "Object/relational mapping is the Vietnam of Computer Science"), should cry fool: Mnesia would offer me persistence without any impedence mismatch with the application runtime environment and I preferred a SQL database to it? Actually, to someone who has used an O/R mapper before and who switched to Erlang, discovering Mnesia for the first time is a sheer heavenly moment similar to that:

The Holy Grail Of Persistence

Though Mnesia is very clearly not presented as a replacement for general purpose RDBMSes, one can not avoid to seriously consider using it, just because there is such a low cost into moving data from and to an Erlang application.

As a developer, I already had my share of joys and pains from working with non-standard persistence engines (like Tamino and X-Hive). I also learned from others who did the same, in much greater scale than me, and who shared their experience about it. So it is with great circumspection that I approached the decision of using a niche database engine instead of a mainstream one.

That being said, here are the four key decision points that made me favor PostgreSQL:

  • Schema Migration - For a startup, it's critical to be able to evolve a database schema with the less friction possible as features are often in a state of flux.

    Using a standard DB like PostgreSQL allowed us to leverage Ruby's ActiveRecord Migration, which is not only handy for migrating forward (as you do in production) but also backwards (as you sometimes have to do in development). Though Mnesia record evolution is possible, the fact that data migration concerns permeate into the application code is very unpleasant. Going schema free was a tempting option but would not have come close to the flexibility ActiveRecord and PostgreSQL gave us.

  • Supporting Resources - Being able to solve problems quickly is essential for a startup: for everything that is not your core business, you usually rely a lot on the information available out there.

    PostgreSQL has an extensive body of knowledge available online and in print. When things go haywire or in case of doubt, you're pretty much guaranteed that a Google search will bring you at least a couple of pages where people asked the exact same question and got answers for them. With Mnesia, the amount of available information is way reduced, simply because it's still very much a niche database.

  • Standard Connectivity - When you're focused on building something new, the last thing you want is wasting time in re-inventing the wheel: interoperable building blocks are key.

    Using an standard database like PostgreSQL gave us immediate access to tools like Pentaho's Data Integration, which we use to massage data. Though we could have built an army of supporting tools to perform the same on Mnesia, it's always better to use something that's already there. I has also allowed us to fully leverage Ruby On Rails to build an awesome back office in no time. Though there are some Ruby-Erlang bridges out there, none gives you all the RAD features you get when plugging Rails to a standard database.

  • Operational Simplicity - In a startup, there's no DBA to nurse your database engine: you have to deal with it so it better be simple to operate.

    Installing, upgrading, backing-up, restoring PostgreSQL databases are all well defined operations, supported by a wealth of tools. The security model is straightforward too. And there are plenty of options for monitoring what's happening under the hood and analyze and tune performances. I have no doubt all this is possible with Mnesia, but in a less familiar and straightforward manner.
Of course, there is a downside in using PostgreSQL with Erlang, and a pretty big one: there is no official driver for it so you're fully subject to the talent of the developer whose driver you'll be using. For us, it quickly turned out that the driver we started with was the Achilles' Heel of our application and we had to switch to another implementation, which turned out to be very solid. The switch was painful because there is no such thing as edbc, i.e. a standard for database connectors in Erlang. If you switch driver, you get a new API!

At this point, some pundits must be fuming and asking why SQL? What about NoSQL? Partially for the same reasons quoted above. But more importantly, we're not locked with PostgreSQL: we mainly rely on this database engine for its transactional capacities, not for its relational ones. If the need arise, the way our application is architectured would allow us to swap-in another persistence engine, provided it's transactional, one functional domain at a time and this without too much pain.

Finally, if you wonder if I picked up PostgreSQL because I was familiar with this database, the answer is that I never used it before. But nothing looks like a RDBMS than another RDBMS. Granted they don't shine like the Holy Grail, but still they'll happily power your software house.

Wednesday, January 06, 2010

Monitoring RabbitMQ with Zabbix

If you use RabbitMQ as your message oriented middleware and Zabbix as your monitoring and graphing tool, you're probably wondering how to monitor the former with the latter.

Here is the Zabbix Agent configuration I use to keep track of the number of messages pending delivery and the total number of queues (this second parameter may not make sense for you if you don't create a lot of dynamic queues):

As you can see, these user parameters are parameterized: they take a single parameter being the virtual host path that you want to monitor. Note also that the zabbix group must be added to the non-password sudoers for rabbitmqctl.

With these parameters in place, you'll be able to build graphs and set alarms for your favorite RabbitMQ virtual hosts!

UPDATE 10-FEB-2010: Alexis Richardson has been kind enough to point towards an SNMP plug-in for RabbitMQ that has been very recently released on GitHub. I have added a few features to it, so be sure to check my fork too.

UPDATE 04-MAR-2010: I'm now using the SNMP plug-in for RabbitMQ in production instead of the above solution, which is way more efficient. The use case for the above would then be only when SNMP is not an option for you.