Saturday, March 29, 2008

Dumb XML

Beautiful Code is a book that can not let oneself indifferent. While reading it, I have been really annoyed by a statement made by one of the authors:
What I always tell people is that XML documents are just big text strings. Therefore, it's usually easier to just write one out using StringBuffer rather than trying to build a DOM (Document Object Model) or using a special XML generator library.

I fully disagree with this position, because I have seen time and again the adverse results of such a simplistic approach to XML generation:
  • Malformed XML: basic string construction does not handle general entities escaping, element name validity and correctly balanced tags. I once had to deal with an XML document that was so broken I initially thought it was SGML. It was neither and I ended using regular expressions instead of SAX to parse it.
  • Invalid XML: applications that generate XML should be polite enough to validate the data they produce before sharing it with other applications. At the time DTDs were current, I followed the practice to add the external declaration only if the document was tested to be valid, like a proofing stamp. Of course, I also had to deal with an application that never considered important to output XML that complied with its own schema!
  • Bad encoding: I realize that many developers live and work in a place where ASCII-7 is enough to represent all the characters they need. But the rest of the world cares for accents and other language particularities. Hence again: basic string building gives no guarantee in term of correct representation of Unicode characters.

Of course, for trivial XML blocks that will never contain any special character nor vary in form too much, using a StringBuilder (not StringBuffer by the way, most of the time it is unnecessary to use this synchronized version) is more than enough. Of course, you can use helper class to encoding all strings and escape all entities.

But if you go further than the trivial use cases or if the data you integrate in your XML document comes from an uncontrolled source (like a database connection or another application layer), use a proper library for building XML.

XML is simple, but do not dumb it down to simplistic


zepag said...

Apart from the amazing fact that you own developers ( ;) ), I can not agree more.

In fact XML is a very structured and standard format for data (if not a very readable or concise one).

I'd add that using a proper library allows a proper handling of namespaces with schemas, and that nobody should ignore that.

Sadly XML knowledge is quite low in IT (and I include myself a bit in that ;) ).

One last thing about validation: one great motto is: "validate strictly what you share, and don't do it so much for what you consume".

Cheers ;)

David Dossot said...

I like this motto, it is so true. Indeed, I tend to be very lax when I parse a document from an external source, always assuming the worst, like unexpected namespace changes...

Anonymous said...

You may also want to look at vtd-xml, the latest and most advanced XML processing