Software is awful

Posted by David on Dec 5th, 2008

I used the Wordpress-native XML format to import the nanoblogger data.  WXR is basically just RSS with some extra tags, nanoblogger can create RSS on its own, and for the extra tags I thought that Python would be a good idea.  Python is nice in that it lets you write something quick and sloppy without actually looking too sloppy (I’m looking at you, perl), and it’s supposed to be to handle XML pretty well.  Maybe it can, but I certainly haven’t figured out how.

There are basically two ways to parse XML: the SAX method, a low-level, procedural technique that requres a maze of callback functions to examine the document as it’s parsed, and DOM, an object-oriented method that works with a completely parsed document.  Along with the XML format itself, the W3C also created the Document Object Model, a standard for accessing and manipulating a parsed XML tree, and like most W3C standards, DOM is awful.  It caters to the lowest common denominator of languages (C), and most XML parsers try to implement DOM with standards-compliance in mind, turning what should be a high-level language into an awful, procedural mess.

XML’s verbosity becomes even more boggling once parsed.  For an example, let’s take a look at a simple XML document.

<?xml version="1.0"?>
<root>
   <item>The text you actually want with maybe some <![CDATA[cdata segments]]> in it</item>
</root>

Let’s say you’re starting at the root node of the document, <root>.  What you’d want is a way to get to the text in the <item> child of <root>.  The CDATA part doesn’t matter—it has the same semantics as plain text; CDATA just changes the quoting rules, and the content of <item> is in effect everything that’s there with the <![CDATA[]]> stripped out.  But that’s not what you get.  In DOM terms, <root> contains three nodes: a text node with the linebreak and spaces between <root> and <item>, the actual <item> element, and another text node with the last linebreak before the closing </root>.  <item> also contains three elements: the text leading up to the CDATA section, the CDATA node, and then the text after it.  Given a strict DOM implementation in Python, the most python-y way that easily comes to mind for getting to the text would be something like:

reduce(lambda x, y: x + y.nodeValue, [''] + doc.documentElement.childNodes[1].childNodes)

That, of course, depends on the particulars of all that whitespace we don’t care about.  Now suppose <doc> contains several <item> elements, and perhaps some elements of other types.  You might try to make yourself a list of <item>s using something like

[node for node in doc.documentElement.childNodes if node.nodeName == 'item']

and even now we’re getting sloppy.  nodeName isn’t exactly the same as tagName; nodeName might pick up an unwisely named processing instruction, so we really ought to add a check that node is an Element, and we haven’t even started looking at namespaces.  Xpath offers a query language for getting at particular nodes with particular names and properties, but xpath will just return a NodeList object and leave you back at the beginning as far as getting to the content.

If you don’t see anything wrong with this article so far, you might want to stop reading now.  I am mad at you.

Python comes with a DOM implementation, xml.dom.minidom, for DOM Level 1, and it includes a specification for DOM Level 2—which is basically the same thing as far as node selection—that others can implement.  Pyxml provides DOM Level 2, and both it and minidom are fairly faithful implementations of the W3C standards.  XML in these systems is not an easily manipulated tree, but instead a forest of corner cases and finger-bending verbosity.  This is XML in Python.  Even freaking Javascript handles it better than this.

The best alternative I’ve been able to find is amara, Uche Ogbuji’s attempt to interpret XML in a python-friendly way.  It’s actually pretty nice.  For the document above, I could access the item node (again using “doc” as the parsed document object) using doc.root.item.  For a document with more than one <item>, the same code selects the first <item> node but can also be used as an array or an iterator.  As for the content, the node object implements __str__ sensibly, so just using the in a context that expects a string will provide the text content, CDATA and all.  It just about makes XML make some sense.

Compared to the trials of pyxml or the similarly low-level libxml2 bindings, my problems with amara seem almost trivial.  The first concerns namespaces, an issue that seems doomed to be awful in any implementation.  Google for “xpath default namespace” if you want some fun bedtime reading.  Amara ignores namespaces if you ignore them, which, since you can’t include a colon in a python property, usually works for the best.  The namespace URI is available as a property of the node objects, and, as an added bonus, the amara parser will load the document’s namespace prefixes for use in xpath expressions and serialization.  It also provides a means of specifying a set of namespace prefixes when parsing the document, but I’m not sure where these are actually used.  The extra prefixes seem to be available for xpath, but not for the names used when creating new elements, and serialization will still use whatever was in the original document unless overridden in the serialization function call.  So I guess my complaint here is that the API could stand some better documentation.  And prefixes in element creation would be nice, or at least nicer if it turns out they’re there and I just don’t understand how to use it.

A bigger complaint I have with amara concerns how it handles one of the nastier quirks of Python.  In Python there are two types of strings: the regular kind, and the unicode kind.  Usually this difference isn’t a problem; 'string' and u'string' seem like they would be the same thing, and usually they are.  Python’s idea of objects and types uses a concept known as “duck typing” (if it looks like a duck, and it walks like a duck…), which just means that object types don’t matter as much as the methods they implement.  For example, the str and unicode objects both implement the join() method, so an object of either type can be used in a context that expects join().  The problem with amara is that it requires every string—new element names, attribute names, node and attribute contents—to be a unicode type.  The especially annoying problem with amara is that it doesn’t fail to create nodes using regular strings, but it does fail to serialize nodes using regular strings.

What I really want out of python is about what amara is doing, something that can turn tag names into object names, convert attributes to and from the python dictionary type, and generally hide most of the nastier parts of XML while still exposing enough of it when needed, like the cdataSectionElements parameter in the serializer that I needed in order to make Wordpress not freak out when given unquoted post contents.  But I’d like something that behaves more intuitively for all cases, and, in a language that claims to be pretty alright for XML processing, I’d like XML methods better suited to the language itself built into the standard library.

Leave a Comment




XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>