What Designers of a Protocol can learn from OAI-PMH

July 12, 2009 at 5:35 pm | Posted in cgi, programming | 1 Comment

I spent the last couple of days writing an OAI-PMH data provider (two, actually). And to give a positive twist to the things that bothered me about OAI-PMH, I’ll list them as “things not to do in a protocol”:

  • additional arguments are not ignored but are to be handled as error.

    This leads to a lot of additional code just to make sure a data provider is compliant. But for a different OAI-PMH data provider I wrote before it was even worse: I had one data provider serving different repositories, but I couldn’t simply add an additional URL parameter to the base URL to distinguish these.

  • HTTP is just used as transport layer.

    An error response in OAI-PMH is not an HTTP reponse with a status != 200, but rather an XML error message delivered with HTTP 200. This means more programming work on both server and client side.

  • noRecordsMatch error message instead of delivering an empty list.

    Again this means more work on the data provider side – but also for most harvesters i assume. Additionally, since this error can only be detected after running some logic on the server, it is hard to just implement one check_args routine, which is called before any other action and then only handle the success case.

  • The resumption token.

    The flow control mechanism of OAI-PMH is way too specific. It’s completely geared towards stateful data providers. Imagine your resources live in an SQL database; what you’d do to retrieve them in batches is using simple LIMIT and OFFSET settings. But you can’t just put the offset in the resumption token, becaue it is an exclusive argument, i.e. with the next request from the client, you will receive only the resumption token, but none of the other arguments supplied before. So what the typical CGI programmer ends up with is embedding all other arguments in the resumption token, and upon the next request, parsing the resumption token to reconsruct the complete request.

So if you are about to design a protocol on top of HTTP learn from the mistakes of OAI-PMH.


LinkedData with TurboGears and squid

December 3, 2008 at 6:33 pm | Posted in http, programming, python | 1 Comment

At work we run several TurboGears web applications, deployed behind Apache acting as proxy. I like this setup a lot, because it allows us to reuse our log file analysis infrastructure for the web applications as well.

Some of the web applications serve data that rarely changes, so to lower the traffic for TurboGears, I decided to use a transparent cache proxy. Since logging is already taken care of by Apache, I don’t care about not all requests hitting the application server.

We settled on putting a squid cache between Apache and TurboGears, which worked well after some fiddling.

Recently a new requirement came up: Serving Linked Data with these web applications. This is actually pretty easy to do with TurboGears. Creating RDF+XML with the templating engines works well, and even the content negotiation mechanism recommended for serving the data is well supported. To have TurboGears serve the RDF+XML representation for a resource just decorate the corresponding controller as follows:


TurboGears will pick the appropriate template based on the Accept header sent by the client.

Unfortunately this setup – different pages served for the same URL – doesn’t work well with our squid cache. But the Vary HTTP header comes to our rescue. To tell squid that certain HTTP request headers have to be taken into account when caching a response, send back a Vary header to inform squid about this; thus, squid will use a key combined from URL and the significant headers for the cached page.

Now the only header important for our content negotiation requirement is the Accept header, so putting the following line in the controller does the trick:

response.header['Vary'] = 'accept'


September 11, 2008 at 8:43 pm | Posted in programming | 1 Comment

Last week i found out why it was a piece of cake for JSON to become the X in Ajax. XmlHttpRequest with XML is just a pain. Here’s why.

I wanted to pull data from a feed into a page. The feed is served as application/rss+xml. After a successful XmlHttpRequest, i wanted to retrieve the items form the request object’s responseXML property.

Unfortunately, this didn’t work on IE 6/7. Turns out IE does only parse the response text when the response is served as text/xml. Ok, so we just grab responseText and parse it ourselves. Problem solved.

Hm. It doesn’t work in Konqueror either. Fortunately the trouble with IE led me into the right direction. While Konqueror does parse the response for mime-type application/xml my feed still would end up unparsed.

Of course there’s no cross-browser way of parsing XML. So what i ended up with was:

try {
  var items = req.responseXML.documentElement.getElementsByTagName('item');
} catch(e) {
  try {
    // for IE we have to do the parsing ourselves, because the feed isn't delivered as text/xml ...
    var doc = new ActiveXObject("Microsoft.XMLDOM");
    var items = doc.documentElement.getElementsByTagName('item');
  } catch(e) {
     try {
        // ... same for Konqueror
        var p = new DOMParser();
        var doc =  p.parseFromString(req.responseText, "text/xml");
        var items = doc.documentElement.getElementsByTagName('item');
     } catch(e) {
        // well, at least we'll get the title later
        var items = [];

And don’t get me started on handling namespaced XML …

Migrating data from b2evolution to wordpress μ

August 15, 2008 at 4:29 pm | Posted in programming, publishing, wordpress | 8 Comments

At work, we are just under way to migrate blogs from an b2evolution installation to wordpress μ. Shopping around for tools to do this didn’t reveal anything usable. Either the versions wouldn’t match or the functionality was too limited (generic import from RSS for example wouldn’t help with files, users and comments).

So i came up with my own script. I didn’t advertise it at wordpress.org, because obviously, just as most other migration tools, it’s tailored to our situation. In particular, it’s only tested with b2evolution 2.3 and wordpress 2.6; but since it relies mostly on the database schema, it should also be compatible with wordpress >= 2.3.


  • migrates users, categories, posts, comments and files.
  • migrates one blog at a time.
  • works for wordpress and wordpress μ.
  • undo functionality, thus testing imports is easy.


The script is written in python and supposed to be run on the machine where the wordpress installation is located (to be able to copy files into wp-content). It also assumes that the wordpress database is located on this machine.

The b2evolution installation does not have to be on the same machine, though, because the b2evolution data is scraped from an SQL dump of the database.

So in case somebody finds this useful – but not quite perfect – let me know, and i’ll see what i can do to help.

Update: More info about how to use the script and how to do the migration in general is available at https://dev.livingreviews.org/projects/wpmu/wiki/b2evolutionMigration.

so we got wordpress μ …

July 31, 2008 at 6:00 pm | Posted in programming, publishing, wordpress | Leave a comment

… and what i wanted to do most, was creating posts automatically, using xmlrpc. alas! i couldn’t get it to work for quite some time.

googleing turned up some discussion threads, which told me, that xmlrpc support isn’t enabled by default anymore, possibly for good reason. this is already useful information, because the error you may get from you xmlrpc client is about failed authentication – a strange way to signal: not enabled.

anyway, one comment got me almost there. “Site Admin->Blogs->Edit”. but what then? no mention of xmlrpc, API, blogger, you name it. the setting to change turned out to be “Advanced edit”. set it from 0 to 1, and xmlrpc should work.

Squid as accelerator for TurboGears

April 16, 2008 at 5:49 am | Posted in http, programming | 3 Comments

I’m trying to use squid as transparent caching proxy between apache httpd and a – largely read-only – TurboGears web application.

Apache already acts as proxy, sending requests to the web app which listens on a different port. But using mod_cache was not an option, because for apache 2.0 it’s still experimental and additionally it doesn’t seem to work well with mod_rewrite.

So the idea was, to simply plug in squid.

The main problem so far was to narrow down the monumental squid configuration to the few lines i actually need. This is what i came up with so far:

http_port accel defaultsite=site.served.by.web.app
cache_peer parent 8080 0 no-query originserver
refresh_pattern ^http: 1440 20% 1440 override-expire override-lastmod ignore-no-cache ignore-reload ignore-private
acl all src
acl our_sites dstdomain
http_access allow all
http_access allow our_sites

The http_port directive tells squid to listen on port 8888, in accelerator mode, proxying the dafault site.

The cache_peer directive specifies – i.e. the web app – as only cache peer. So whenever squid cannot serve a request from its cache, this is where it will turn to for help. The last three tokens 0 no-query originserver basically say that this is not another squid proxy, by setting the ICP port to 0.

The refresh_pattern directive specifies the rules according to which an item in the cache is to be treated as fresh or stale. In this case, all items with an http URL will be regarded as fresh for one day (1440 minutes). The options override-expire override-lastmod ignore-no-cache ignore-reload ignore-private basically override whatever either client or web app say about caching – so this setup is NOT an http compliant cache. But that’s alright, since we only cache stuff that we are the producers of, so we should know.

I didn’t spend much time investigationg the access control settings, since i figure my setup – squid only listening on an internal port – does already away with most security concerns.

So this is what the results look like in squid’s access log:

1208325630.099 688 TCP_MISS/200 11103 GET http://localhost/feature/28 - FIRST_UP_PARENT/ text/html
1208325634.274 1 TCP_HIT/200 11109 GET http://localhost/feature/28 - NONE/- text/html

The second token is the number of milliseconds squid needed to process the requests.

Pretty printing xml

November 16, 2007 at 4:37 pm | Posted in programming, xml | Leave a comment

Simple problems should have simple solutions. As a mathematician I do know, that’s not always the case; but I still feel it should be. So how simple is pretty printing xml?

The simplest (in terms of usability on a linux machine) I’ve come up with so far:

tidy -i -xml -asxml input.xml

postgresql trouble

October 30, 2007 at 9:30 am | Posted in postgresql, programming | Leave a comment

Just another post, to give a particular snippet I found on the web more visibility.

Today I ran into problems restoring a postgresql database from an sql dump. Since the same procedure did work before, I was a little worried; but after some googling I found the explanation (in Craig Hockenberry’s comment):

Note that when using pg_dump and a plain file format, that the escape string syntax described in will be used. If this file is then used to import on a previous version of postgres, the following error will be generated: ERROR:  type “e” does not exist.

Since there doesn’t appear to be any way to turn off this feature in pg_dump, the only solution is to upgrade the target database to version 8.1 or later.

Sure enough, I was trying to import a database dumped with 8.1.10 in a postgresql 7.4.7 cluster.

Production Ready

October 19, 2007 at 5:19 pm | Posted in programming | Leave a comment

Today, I discovered a definition for “production ready”:

An application is production ready, if it takes longer to find bugs than to fix bugs.

Of course, that’s not too much of an insight, because it’s just another way of stressing the importance of testing.

Sure enough this thought occurred to me, when working with software which is not production ready.  And a big part of why this particular software is not production ready lies in the fact, that its release/deployment process, which must be counted into the time to fix a bug, is too long to allow the developers to get ahead of me, the tester.

And that’s also why perpetual beta is possible with web applications. With no release/deployment effort, developers can beat the users in the race for bugs – in particular if the developers are users themselves.

Blog at WordPress.com.
Entries and comments feeds.