Sunday, February 06, 2005

Efficient XML processing in Python

In a recent article Decomposition, Process, Recomposition on, Uche Ogbuji talks about strategies for processing large XML documents in Python. This led me to recall my own experience with processing of multimegabytes XML documents.

In general, there are at least three popular strategies to deal with XML in Python: SAX-based, DOM-based and, er, shall I say, pythonic. SAX realizes a speedy and memory-light approach but which shifts the burden of keeping a processing context onto programmer. DOM gives cross-language, standard, verbose and memory-hungry strategy which scale poorly for large documents. By pythonic I meant a range of libraries, like gnosis.xml.objectify, ElementTree, xmltramp which share a common attitude to provide a nice, Python-friendly API.

I used ElementTree (and recently, it's new, C-based re-incarnation) library. The task was as follows: given a large (~ 10Mb) XML document of some complex structure the program had to significantly "extend" it with new information and write to a new file. This new data was partially specified using the external sources and partially computed from the document itself, according to a bunch of arcane rules. Computation involved several traversals along the document structure to gather needed information.

The program (and transformation) itself was just a single link in a lengthy chain of transformations. There were other parts, written as XSLT procedures and DOM-based Javascripts that used to perform other

A decompose, process, recompose approach, outlined in the article was realized by the ElementTree itself, my task was only to provide necessarily processing steps. In contrast with SAX, ElementTree builds an accurate in-memory presentation of the entire XML document, but unlike DOM, it's memory footprint grows much more slowly with the size of document, thanks to efficient representation. This gives the best of both worlds: versatile and convenient presentation model with good scalability for large documents.

My only major complaint was the lack of XPath support. While ElementTree does offer some very limited support for basic XPath expression it goes a long way towards a full-blown implementation. Thus I was forced to write a bunch of custom finders where a single XPath would do the trick. Luckily, it's pretty straightforward to write them.


Post a Comment

<< Home