Sunday, November 07, 2004

Experiments in Writing an Offline Blog

I did a very strange thing yesterday. I wrote a blog. At least, it might look like a blog if you allowed certain liberties, such as the fact that most blogs don't force you to upload a static HTML file in order to publish a post. I will try to explain why I did this, but I'm going to tell you up front that the explanation will not be satisfying and largely amounts to "because I could". I will also explain how I did it, and that will be more interesting if you have a problem in the same category as mine.

The Problem

As the author of PBP, I have to keep the front page updated with news items. Until now I've been doing this by editing an HTML file by had. Well, I hate editing XML by hand, and while HTML at least is supported by editors, they never get the markup quite right. The editor I'm using right now to generate this blog entry is doing ugly things like using font style attributes and making guesses about where my line breaks should go. For prose on my blog, that's acceptable, but I expect a certain precise format for my "professional" projects.

Frustrated with the errors I was constantly making getting the news file just right, I decided to make news entries something much simpler to type. I've been doing a lot of successful experimenting with YAML lately, and although I had previously decided it was more appropriate for "data" formats than "document" formats (whatever you think that means) I figured news entries were brief enough and structured enough that YAML might be appropriate.

I also needed a way to render the YAML to HTML. Well, I already had this, in Nevow. Of course, you can't run your own webserver on BerliOS, and Nevow isn't really appropriate for CGI yet, and in any case I have no need yet for this level of control. I just needed a way to generate the static HTML file that I was at the time generating by hand.

When I was done I wanted a tool I could run on a YAML file, which would plug my news items into a template and give me an HTML document suitable for posting on the web. Then in the future, when I want to edit my news items, I only have to edit a YAML file. As YAML is much more readable and writeable than XML and need contain only the dynamic part of the page (the news items) instead of all of it, this would less onerous a chore. As a bonus, if I made a formatting mistake, I should get a parse error instead of hiding the problem behind a web browser's sorta-error-correcting rendering engine.

The Implementation: An "Offline Blog"

The implementation for this is available in PBP's svn repository, which you can check out with the command
svn co svn://svn.berlios.de/pbp/trunk pbp
I will refer to parts of this by their location relative to the root of the working copy, so (for example) /doc/pbp.xhtml refers to svn://svn.berlios.de/pbp/trunk/doc/pbp in the repository and pbp/doc/pbp.xhtml in the working copy.

Step one was getting Nevow to render a static HTML file. Fortunately this is easy. There are two ways to do it. One is the newly-added Page.renderSynchronously method, which returns a rendered HTML file. The other is to start and stop a reactor and use plain old callbacks to finish rendering. The former method is fine if you are sure you'll never need to execute anything asynchronously while publishing the file (therefore needing real deferred handling), but I figured I would go ahead and use the reactor in case I ever made a GUI out of this.

First I designed a template: /doc/pbp.xhtml. I won't go into the details of Nevow templates, and this one is completely unremarkable. Only two interesting things will be plugged in: the latest released version, which appears in the template as <n:slot name='latest' />, and the news items, which appear as <n:invisible render='news' />.

Then I needed some data to put into the template. I had to learn up a bit on YAML, but I came up with /doc/news.yml with very little trouble. (Tip: To use >-style folding, you need to indent everything in the block, even paragraph separators. A paragraph separator is actually a line containing only n blank spaces between two paragraphs, where n is the number of spaces the preceding paragraph is indented.) The YAML document contains two items, "release:" which contains the version of the most recent release, and "news:" which is a sequence of all the news items, ever. Each news item in turn is a map of three items, "date:", "title:", and "c:" which stands for content. The important thing to see is how simple it is to add new items to this file, and how extremely readable the file is. And since it's done with YAML, I didn't have to write my own parser for the format, as PyYaml already exists.

To parse this file into native Python data types is just one call:
ydata = list(yaml.loadFile(yamlfile))[0]
yaml.loadFile returns a stream of YAML documents, and is designed for streaming data. I turn this generator into a list with list() and take the first (zero'th) item from it, as there is only one YAML document there.

To go with the template I needed a rend.Page subclass, which is news2html.Page in /doc/news2html.py. I pass ydata to the Page and the news items and release version are extracted in Page.render_news and Page.render_download.

Now there was just one problem left to solve. I wanted to be able to put links in my HTML documents, of the form <a href='...'>blah blah</a>.

A Secret Friend

I considered several possibilities for the problem of HTML linking. One was to have a sequence of links as an item at the top of the YAML document, and make references to that, but this would have significantly increased the work typing each entry, even the ones without links... and most of the entries do not have links. I chose instead to make a small concession to the Wiki way of doing things and write links like this: [I am the text around the href which is http://foo.com/bar].

To do this I needed a parser. Let it be said that I have never had any luck writing my own parsers for anything, and even extremely simple parses like this have stymied me in the past. I don't know but for some reason I suck at parsing. It can be said to be my Achilles heel, the thing I dread doing more than anything else.

Perhaps that will no longer be true. At the suggestion of deltab on IRC, I looked into an undocumented feature of Python's sre module, sre.Scanner. This provides a simple scanning interface which lets you define tokens with regular expressions--a common tactic in parsing-helper modules--and then bind them to callbacks. You create a Scanner object initializing it with a list of the tokens, in the order that they should be matched. There were only three interesting tokens for me: bracketed links, "[[" which I intended to allow as an escape for the [ character, and everything else. The code to emit <a> links is below:
def got_char(scanner, token): return token


def got_bracket(scanner, token): return '['

def got_link(scanner, token):
# strip start and end brackets
token = token[1:-1]

words = token.split()
url = words[-1]
rest = ' '.join(words[:-1])

return T.a(href=url)[rest]

tokens = [(r'\[\[', got_bracket),
(r'\[[^\]]+\]', got_link),
('.', got_char),
]
scanner = sre.Scanner(tokens)
...
scanner.scan(para)

If you look at my code in news2html.py, you'll see that I later on added scanning to fix double and single quotes, turning them into double and single “smart” quotes.

As finishing touches, I added the command line option --count to control how many news items will appear in the generated page, and allowed an additional command line parameter to control which template is used to generate the HTML page.

Next Steps

RSS! I added a --output command line option, and for my next project I'll be using news2html to generate a feed for PBP. Not that many people will notice or care, but when it's this easy why not make it available for those who do?

No comments: