python saxparser woes

September 5, 2007 at 6:34 am | Posted in python | Leave a comment

To make a long story short: When you’re trying to use python’s xml.sax saxparser for namespace aware xml processing, you may be in for trouble.

If you run the following little script

import xml.sax, StringIO

class Parser(xml.sax.handler.ContentHandler):
    def startElementNS(self, name, qname, attrs):
        print name, qname, attrs
    def endElementNS(self, name, qname):
        print name, qname

p = xml.sax.make_parser(["drv_libxml2"])
p.setFeature(xml.sax.handler.feature_namespaces, 1)
p.setContentHandler(Parser())
s = xml.sax.xmlreader.InputSource()
s.setByteStream(StringIO.StringIO("""<?xml version='1.0' encoding='utf-8'?>
<prefix:element xmlns:prefix="http://www.python.org/sax_error"/>
"""))
p.parse(s)

you should see something like


@:/tmp$ python test.py
(u'http://www.python.org/sax_error', u'element') prefix:element
(u'http://www.python.org/sax_error', u'element') prefix:element

What you may see though, is


~ > python test.py
(u'http://www.python.org/sax_error', u'element') prefix:element
(u'http://www.python.org/sax_error', u'element') None

In the latter case, endElementNS does not get a proper qualified name and code relying on this (e.g. feedparser) will not work as expected (e.g. not find elements in an RSS 2.0 feed).

It turns out that setting


p = xml.sax.make_parser(["drv_libxml2"])
p.setFeature(xml.sax.handler.feature_namespaces, 1)

(i.e. specifying drv_libxml2 as preferred driver) does not ensure you get a namespace aware parser. Instead, if python bindings for libxml2 are not installed, xml.sax will silently fall back to the default, which – as exhibited above – does not what you want.

This behaviour is in my opinion totally inappropriate – and makes problems really hard to debug. My colleague and I actually started doubting the universality of the universal feedparser. It’s also not in line with “explicit” is better than “implicit”.

But then again, maybe it’s just one of the things you need to know.

Advertisements

Leave a Comment »

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.
Entries and comments feeds.

%d bloggers like this: