Saturday, November 13, 2004

Making XML Suck Less

XML sucks. It's axiomatic these days, except to those corporate code-jockeys who happily implement everything their pointy-haired bosses tell them is buzzword compliant. (Many, I'm sure, do it to keep their job; just as many, I feel, probably do it because they believe their own bullshit.)

One of the corrollaries of "XML Sucks" is "the great tools are what make XML suck less." And the pro-XML faction firmly believe that XML is what makes it possible to implement those great tools. I hold to another point of view. XML Sucks, the tools for processing XML suck, and XML makes it harder to implement those tools. But it is possible to write tools that don't suck, and that would make XML suck less.

There are a few bright points. Frameworks like Nevow make XML generation both elegant and idiot-proof. Tools like uTidylib and BeautifulSoup make it possible to clean up other peoples' garbage. And some of the standards for XML are actually useful. CSS is simply an awesome way to change the appearance of a rendered document. Relax-NG provides a way to define a valid XML format without the awfulness of a DTD or the brain-numbing pedantry of a schema, and even has a compact form. (An aside: the word "compact" gets applied to a lot of XML-related technologies. It's actually code for "you don't have to use that awful XML crap to write this, you can use a more sensible syntax instead." Examples: XSLTXT, Relax-NG compact.) And then there's XPath which provides a concise way to get a set of nodes from a document, and is extensible.

Extending XPath

I want to use XPath on a project I'm working on. (If it goes anywhere, I'll blog about that too.) To use XPath, you generally have to provide some extension functions. This is because XPath's set of core functions, while sensible, can't do some basic operations. That other standards support. For example, you can interact with a CSS stylesheet by multiclassing nodes, like this:
<div class="important">1. You should turn your car on before attempting to drive it.</div>
<div class="dangerous">2. If you drop a lit match in the gas tank, bad things will happen.</div>
<div class="important dangerous">3. Don't drink and drive.</div>
<div class="dangerous important">4. Use tire chains when driving on an icy road.</div>
<div class="not-important">5. Your glove compartment can be used to store maps.</div>

This applies both the important and dangerous styles to the node last node, whatever that means. In CSS, it probably specifies an appearance, but other applications acting on the same file may also want to get those classes. XPath can't select a node that uses classes in this way. Consider how we might select the nodes with class 'important'.
//*[@class='important'] # only matches (1), not (3) or (4)
//*[starts-with("important", @class)] # only matches (1) and (3)
//*[contains("important", @class)] # matches (1), (3) and (4) .. oops, and (5) too.

After trying libxml2 and having it crash the Python interpreter on me (this is the second time I've given it a try; there won't be a third) I installed pyxml 0.8.3. First thing I had to do was figure out how to implement an extension function for using regular expressions, such that:
//*[func("\bimportant\b", @class)]
returns the nodes I want, by regular expression selection.

First I learned that namespaces in XPath are, quite naturally, XML namespaces. Therefore, to call func I needed to define a namespace, analogous a Python module, for it to live in. I chose an arbitrary URL at a hypothetical developer.berlios.de website for the hypothetical project I'm working on: "http://bypath.berlios.de/2004/11/bypath". This string is itself the namespace; when you want to refer to the namespace, you use an alias, called a prefix.

Then I wrote the first version of my extension function:
def simpleSearchRe(ctx, expr, input):

return re.search(expr, input) is not None

Some things to note. simpleSearchRe takes three arguments, not two. The first argument is the XPath evaluation context object, which pyxml always passes to extension functions. simpleSearchRe returns True or False, not a string or a node or a nodelist or something else; so it can be used to filter nodes in an XPath expression in exactly the manner I demonstrated above.

Once you have an extension function and a namespace, making the actual XPath binding is simplicity itself. The code:
xpath.g_extFunctions.update({...})
The dict you pass maps a (namespace, function-name) tuple to the actual function. Example:
xpath.g_extFunctions.update({("http://bypath.berlios.de/2004/11/bypath", "simple-search-re"): simpleSearchRe})

Unfortunately the above implementation of simpleSearchRe is too naïve for my desired use. For example, this works:

//*[by:simple-search-re('\bimportant\b', string(@class))]

but not my example. This returns a RuntimeError:
//*[by:simple-search-re('\bimportant\b', @class)]

Since developers familiar with XPath will expect both to work, I had to dig deep to find out what was happening. Finally I learned that @class becomes a list of nodes with one item, and my function tries to treat it like a string. There's a big gotcha: a bug in pyxml makes it appear that my code isn't even being called, because simpleSearchRe does not appear in the traceback call stack anywhere. In fact what's happening is my code is raising a simple TypeError, and pyxml then eats that error and issues its own, in a different part of the code.

This revised version worked:
def simpleSearchRe(ctx, expr, input):

# convert a node, nodelist or a string to a string
input = xpath.Conversions.StringValue(input)
return re.search(expr, input) is not None

The full module, demonstrating how one extends pyxml's xpath with new functions:
from xml.dom import minidom

from xml import xpath
import re

# define a namespace
BYPATH_NAMESPACE = 'http://bypath.berlios.de/2004/11/bypath'

# define a function capable of coercing its arguments to strings and
# operating on them
def simpleSearchRe(ctx, expr, input):
# convert a node or a string to a string
input = xpath.Conversions.StringValue(input)
return re.search(expr, input) is not None

# add the function to the global list of extension functions, bound to
# an xpath name in an xml namespace
xpath.g_extFunctions.update({(BYPATH_NAMESPACE, 'simple-search-re'):
simpleSearchRe})

# test doc
doc = minidom.parseString('<y class="aa bb"><x class="b a c"/></y>')

# create a context which knows about our namespace
ctx = xpath.CreateContext(doc)
ctx.setNamespaces({'by':BYPATH_NAMESPACE})

xeval = lambda expr: xpath.Evaluate(expr, context=ctx)

# tests
print xeval('//*')
print xeval(r'//*[by:simple-search-re("\ba\b", string(@class))]')
print xeval(r'//*[by:simple-search-re("\ba\b", @class)]')

Sunday, November 07, 2004

Experiments in Writing an Offline Blog

I did a very strange thing yesterday. I wrote a blog. At least, it might look like a blog if you allowed certain liberties, such as the fact that most blogs don't force you to upload a static HTML file in order to publish a post. I will try to explain why I did this, but I'm going to tell you up front that the explanation will not be satisfying and largely amounts to "because I could". I will also explain how I did it, and that will be more interesting if you have a problem in the same category as mine.

The Problem

As the author of PBP, I have to keep the front page updated with news items. Until now I've been doing this by editing an HTML file by had. Well, I hate editing XML by hand, and while HTML at least is supported by editors, they never get the markup quite right. The editor I'm using right now to generate this blog entry is doing ugly things like using font style attributes and making guesses about where my line breaks should go. For prose on my blog, that's acceptable, but I expect a certain precise format for my "professional" projects.

Frustrated with the errors I was constantly making getting the news file just right, I decided to make news entries something much simpler to type. I've been doing a lot of successful experimenting with YAML lately, and although I had previously decided it was more appropriate for "data" formats than "document" formats (whatever you think that means) I figured news entries were brief enough and structured enough that YAML might be appropriate.

I also needed a way to render the YAML to HTML. Well, I already had this, in Nevow. Of course, you can't run your own webserver on BerliOS, and Nevow isn't really appropriate for CGI yet, and in any case I have no need yet for this level of control. I just needed a way to generate the static HTML file that I was at the time generating by hand.

When I was done I wanted a tool I could run on a YAML file, which would plug my news items into a template and give me an HTML document suitable for posting on the web. Then in the future, when I want to edit my news items, I only have to edit a YAML file. As YAML is much more readable and writeable than XML and need contain only the dynamic part of the page (the news items) instead of all of it, this would less onerous a chore. As a bonus, if I made a formatting mistake, I should get a parse error instead of hiding the problem behind a web browser's sorta-error-correcting rendering engine.

The Implementation: An "Offline Blog"

The implementation for this is available in PBP's svn repository, which you can check out with the command
svn co svn://svn.berlios.de/pbp/trunk pbp
I will refer to parts of this by their location relative to the root of the working copy, so (for example) /doc/pbp.xhtml refers to svn://svn.berlios.de/pbp/trunk/doc/pbp in the repository and pbp/doc/pbp.xhtml in the working copy.

Step one was getting Nevow to render a static HTML file. Fortunately this is easy. There are two ways to do it. One is the newly-added Page.renderSynchronously method, which returns a rendered HTML file. The other is to start and stop a reactor and use plain old callbacks to finish rendering. The former method is fine if you are sure you'll never need to execute anything asynchronously while publishing the file (therefore needing real deferred handling), but I figured I would go ahead and use the reactor in case I ever made a GUI out of this.

First I designed a template: /doc/pbp.xhtml. I won't go into the details of Nevow templates, and this one is completely unremarkable. Only two interesting things will be plugged in: the latest released version, which appears in the template as <n:slot name='latest' />, and the news items, which appear as <n:invisible render='news' />.

Then I needed some data to put into the template. I had to learn up a bit on YAML, but I came up with /doc/news.yml with very little trouble. (Tip: To use >-style folding, you need to indent everything in the block, even paragraph separators. A paragraph separator is actually a line containing only n blank spaces between two paragraphs, where n is the number of spaces the preceding paragraph is indented.) The YAML document contains two items, "release:" which contains the version of the most recent release, and "news:" which is a sequence of all the news items, ever. Each news item in turn is a map of three items, "date:", "title:", and "c:" which stands for content. The important thing to see is how simple it is to add new items to this file, and how extremely readable the file is. And since it's done with YAML, I didn't have to write my own parser for the format, as PyYaml already exists.

To parse this file into native Python data types is just one call:
ydata = list(yaml.loadFile(yamlfile))[0]
yaml.loadFile returns a stream of YAML documents, and is designed for streaming data. I turn this generator into a list with list() and take the first (zero'th) item from it, as there is only one YAML document there.

To go with the template I needed a rend.Page subclass, which is news2html.Page in /doc/news2html.py. I pass ydata to the Page and the news items and release version are extracted in Page.render_news and Page.render_download.

Now there was just one problem left to solve. I wanted to be able to put links in my HTML documents, of the form <a href='...'>blah blah</a>.

A Secret Friend

I considered several possibilities for the problem of HTML linking. One was to have a sequence of links as an item at the top of the YAML document, and make references to that, but this would have significantly increased the work typing each entry, even the ones without links... and most of the entries do not have links. I chose instead to make a small concession to the Wiki way of doing things and write links like this: [I am the text around the href which is http://foo.com/bar].

To do this I needed a parser. Let it be said that I have never had any luck writing my own parsers for anything, and even extremely simple parses like this have stymied me in the past. I don't know but for some reason I suck at parsing. It can be said to be my Achilles heel, the thing I dread doing more than anything else.

Perhaps that will no longer be true. At the suggestion of deltab on IRC, I looked into an undocumented feature of Python's sre module, sre.Scanner. This provides a simple scanning interface which lets you define tokens with regular expressions--a common tactic in parsing-helper modules--and then bind them to callbacks. You create a Scanner object initializing it with a list of the tokens, in the order that they should be matched. There were only three interesting tokens for me: bracketed links, "[[" which I intended to allow as an escape for the [ character, and everything else. The code to emit <a> links is below:
def got_char(scanner, token): return token


def got_bracket(scanner, token): return '['

def got_link(scanner, token):
# strip start and end brackets
token = token[1:-1]

words = token.split()
url = words[-1]
rest = ' '.join(words[:-1])

return T.a(href=url)[rest]

tokens = [(r'\[\[', got_bracket),
(r'\[[^\]]+\]', got_link),
('.', got_char),
]
scanner = sre.Scanner(tokens)
...
scanner.scan(para)

If you look at my code in news2html.py, you'll see that I later on added scanning to fix double and single quotes, turning them into double and single “smart” quotes.

As finishing touches, I added the command line option --count to control how many news items will appear in the generated page, and allowed an additional command line parameter to control which template is used to generate the HTML page.

Next Steps

RSS! I added a --output command line option, and for my next project I'll be using news2html to generate a feed for PBP. Not that many people will notice or care, but when it's this easy why not make it available for those who do?

Thursday, November 04, 2004

apt-get install antidote

New Distro #73

OK, so I've started to play with Ubuntu. Kick-ass distribution. Installed without a hitch on a server, a desktop machine, 4 Sony laptops and one Dell laptop. And by 'installed' I mean: got the screen resolution, sound, and network cards right, first try, no human intervention. Two different kinds of wireless network cards in that group, and the server's network card is a funky dual Intel one that no other distribution has picked up yet out of the box. Knoppix couldn't even do that.

There's one small problem, and it is small (smaller than I thought it was, even): not all Debian packages are available out of the box. They only support a subset (a large subset) of Debian's recent vintage packages.

I needed winbind, first thing, and that package is not in Ubuntu. So I did what I have been told time and again never to do: I added the Debian sid archive to /etc/apt/sources.list and fired up Synaptic to install the winbind package. This went off without a hitch, but on my next upgrade I ended up upgrading hundreds of packages to sid, since sid is now slightly more recent than Ubuntu for very uninteresting reasons.

Oops

This messed up Ubuntu's nice desktop layout. Realizing I didn't want to keep it that way, I then removed the sid repositories from sources.list and then downgraded to the Ubuntu versions of a few of the desktop packages using an old apt trick:

$ sudo apt-get install gnome-system-tools=1.0.0-0ubuntu7

This was fine, got my desktop back to pretty again. I knew I still had loads of "foreign" packages installed on my system, but I resolved not to think about it too hard.

But pretty soon I needed another package. This one happened to be python-dev. The Python package structure in Debian (and therefore Ubuntu) is a rat's nest of weird, exact-version dependencies. If you try to mix repositories on Python, you will end up reinstalling hundreds of packages, even using the trick above. Wait, that's an exaggeration: in my case, the number was actually 76 packages that had to be reinstalled because I • already had Debian's sid version of Python installed and • needed to install one from Ubuntu. In order to use the apt trick above to fix this problem, I would have had to look up the most recent Ubuntu version of all 76 of those packages (using apt-cache showpkg) and then type each one on the command line. If I'm smart, maybe I could write a little Python code to help me build the list, but it would still be a monstrous pain in the ass.

I have torn my hair out on this issue many times before, when I did things like install Debian sarge and then install everything from backports.org, or try to add one sid package to a woody system and end up installing hundreds of unwanted packages. My conclusion has always been: you can't go back. You're stuck with your mixed repositories, and you'll end up upgrading everything under the sun. Hope everything keeps working right! :-)

You Can Go Home Again

Turns out that's not true. The solution turns out to be quite simple. There's a file called apt_preferences by the man page. It's in /etc/apt/preferences. It controls where you get packages from, and under what conditions, when you have multiple sources in sources.list. It makes filters by combining groups of three facts:
  • What packages the filter applies to, with "Package:"
  • Where those packages come from, with "Pin:"
  • How high a priority you want to set for the filtered packages, with "Pin-Priority:"
Basically you say, "for these packages, if they're coming from this location, install them with this much priority." You can, for example, use apt_preferences to set a priority of 0 which says "Don't install the package from here, ever." I wanted just the opposite, the most forceful priority possible. A priority > 1000 says "Install packages from this location, no matter what, even if they're less recent than the one already installed." Used like this, it forces Ubuntu to downgrade everything on your system to the version in Ubuntu:

Package: *
Pin: release a=warty
Pin-Priority: 1001

Once you've made this setting, just do an apt-get upgrade. I don't promise it'll be smooth as glass; packages aren't tested for downgrading, let alone downgrading across distributions. At worst you will have to run it a few times, and possibly manually select a few packages to remove completely to prevent conflicts, but it does work. I just used it to downgrade 466 packages from sid to warty. :-)

So then they tell me about the universe repository, which is where I should have found winbind. . .

By the Way

Don't forget to change your apt_preferences back to the way it comes out of to the box (missing). Just delete it when you're done. Otherwise, you won't get security updates, because warty/security doesn't match the Pin you specified in apt_preferences. Once you've gotten rid of the preferences file, do another apt-get upgrade and you'll re-get the most recent security updates.

Update


In addition to universe, I recently learned how to enable the multiverse repository and I feel it deserves a mention because useful things like browser plugins live there. Add this to your sources.list:
deb http://archive.ubuntu.com/ubuntu/ warty multiverse