Saturday, November 13, 2004

Making XML Suck Less

XML sucks. It's axiomatic these days, except to those corporate code-jockeys who happily implement everything their pointy-haired bosses tell them is buzzword compliant. (Many, I'm sure, do it to keep their job; just as many, I feel, probably do it because they believe their own bullshit.)

One of the corrollaries of "XML Sucks" is "the great tools are what make XML suck less." And the pro-XML faction firmly believe that XML is what makes it possible to implement those great tools. I hold to another point of view. XML Sucks, the tools for processing XML suck, and XML makes it harder to implement those tools. But it is possible to write tools that don't suck, and that would make XML suck less.

There are a few bright points. Frameworks like Nevow make XML generation both elegant and idiot-proof. Tools like uTidylib and BeautifulSoup make it possible to clean up other peoples' garbage. And some of the standards for XML are actually useful. CSS is simply an awesome way to change the appearance of a rendered document. Relax-NG provides a way to define a valid XML format without the awfulness of a DTD or the brain-numbing pedantry of a schema, and even has a compact form. (An aside: the word "compact" gets applied to a lot of XML-related technologies. It's actually code for "you don't have to use that awful XML crap to write this, you can use a more sensible syntax instead." Examples: XSLTXT, Relax-NG compact.) And then there's XPath which provides a concise way to get a set of nodes from a document, and is extensible.

Extending XPath

I want to use XPath on a project I'm working on. (If it goes anywhere, I'll blog about that too.) To use XPath, you generally have to provide some extension functions. This is because XPath's set of core functions, while sensible, can't do some basic operations. That other standards support. For example, you can interact with a CSS stylesheet by multiclassing nodes, like this:
<div class="important">1. You should turn your car on before attempting to drive it.</div>
<div class="dangerous">2. If you drop a lit match in the gas tank, bad things will happen.</div>
<div class="important dangerous">3. Don't drink and drive.</div>
<div class="dangerous important">4. Use tire chains when driving on an icy road.</div>
<div class="not-important">5. Your glove compartment can be used to store maps.</div>

This applies both the important and dangerous styles to the node last node, whatever that means. In CSS, it probably specifies an appearance, but other applications acting on the same file may also want to get those classes. XPath can't select a node that uses classes in this way. Consider how we might select the nodes with class 'important'.
//*[@class='important'] # only matches (1), not (3) or (4)
//*[starts-with("important", @class)] # only matches (1) and (3)
//*[contains("important", @class)] # matches (1), (3) and (4) .. oops, and (5) too.

After trying libxml2 and having it crash the Python interpreter on me (this is the second time I've given it a try; there won't be a third) I installed pyxml 0.8.3. First thing I had to do was figure out how to implement an extension function for using regular expressions, such that:
//*[func("\bimportant\b", @class)]
returns the nodes I want, by regular expression selection.

First I learned that namespaces in XPath are, quite naturally, XML namespaces. Therefore, to call func I needed to define a namespace, analogous a Python module, for it to live in. I chose an arbitrary URL at a hypothetical developer.berlios.de website for the hypothetical project I'm working on: "http://bypath.berlios.de/2004/11/bypath". This string is itself the namespace; when you want to refer to the namespace, you use an alias, called a prefix.

Then I wrote the first version of my extension function:
def simpleSearchRe(ctx, expr, input):

return re.search(expr, input) is not None

Some things to note. simpleSearchRe takes three arguments, not two. The first argument is the XPath evaluation context object, which pyxml always passes to extension functions. simpleSearchRe returns True or False, not a string or a node or a nodelist or something else; so it can be used to filter nodes in an XPath expression in exactly the manner I demonstrated above.

Once you have an extension function and a namespace, making the actual XPath binding is simplicity itself. The code:
xpath.g_extFunctions.update({...})
The dict you pass maps a (namespace, function-name) tuple to the actual function. Example:
xpath.g_extFunctions.update({("http://bypath.berlios.de/2004/11/bypath", "simple-search-re"): simpleSearchRe})

Unfortunately the above implementation of simpleSearchRe is too naïve for my desired use. For example, this works:

//*[by:simple-search-re('\bimportant\b', string(@class))]

but not my example. This returns a RuntimeError:
//*[by:simple-search-re('\bimportant\b', @class)]

Since developers familiar with XPath will expect both to work, I had to dig deep to find out what was happening. Finally I learned that @class becomes a list of nodes with one item, and my function tries to treat it like a string. There's a big gotcha: a bug in pyxml makes it appear that my code isn't even being called, because simpleSearchRe does not appear in the traceback call stack anywhere. In fact what's happening is my code is raising a simple TypeError, and pyxml then eats that error and issues its own, in a different part of the code.

This revised version worked:
def simpleSearchRe(ctx, expr, input):

# convert a node, nodelist or a string to a string
input = xpath.Conversions.StringValue(input)
return re.search(expr, input) is not None

The full module, demonstrating how one extends pyxml's xpath with new functions:
from xml.dom import minidom

from xml import xpath
import re

# define a namespace
BYPATH_NAMESPACE = 'http://bypath.berlios.de/2004/11/bypath'

# define a function capable of coercing its arguments to strings and
# operating on them
def simpleSearchRe(ctx, expr, input):
# convert a node or a string to a string
input = xpath.Conversions.StringValue(input)
return re.search(expr, input) is not None

# add the function to the global list of extension functions, bound to
# an xpath name in an xml namespace
xpath.g_extFunctions.update({(BYPATH_NAMESPACE, 'simple-search-re'):
simpleSearchRe})

# test doc
doc = minidom.parseString('<y class="aa bb"><x class="b a c"/></y>')

# create a context which knows about our namespace
ctx = xpath.CreateContext(doc)
ctx.setNamespaces({'by':BYPATH_NAMESPACE})

xeval = lambda expr: xpath.Evaluate(expr, context=ctx)

# tests
print xeval('//*')
print xeval(r'//*[by:simple-search-re("\ba\b", string(@class))]')
print xeval(r'//*[by:simple-search-re("\ba\b", @class)]')

No comments: