Vary: Accept and caching

January 26, 2010 at 6:58 pm | Posted in apache, http, squid | Leave a comment

Ok, so the kind of content negotiation necessary to serve linked data does not prevent using squid as a transparent cache proxy: Vary: Accept to the rescue!

Unfortunately it turns out that there are quite a few different Accept headers around, although probably an order of magnitude less than User-Agent headers. On a medium-traffic site this reduces the effect of caching almost to a server-side replication of a single machine’s browser cache. (In fact, considering that a Vary: Accept header prevents IE6 from caching anything, in some cases it recreates the browser cache with the added latency.)

So in a situation where the only content negotiation our application actually does is whether to serve application/rdf+xml or do the default thing, what can we do? Obviously, reducing the number of different Accept headers that are sent to squid could help. It turns out that this is pretty easy to accomplish (in case an Apache httpd is running in front of squid):

SetEnvIf Accept ".*application\/rdf\+xml.*" RDF_XML_request
RequestHeader unset Accept env=!RDF_XML_request

Whenever the Accept header contains application/rdf+xml, we leave it intact (hoping that the number of different Accept headers fulfilling this condition will be rather small), otherwise we simply remove it, forcing the upstream application to resort to the default behaviour.

Of course as soon as the upstream application does more in terms of content negotiation than simply checking whether application/rdf+xml is the best match for a request, this scheme breaks down. But are there any other problems?

Update: The above configuration works for Apache httpd 2.2.x, for 2.0.x I was successful with

SetEnvIf Accept ".*application\/rdf\+xml.*" RDF_XML_request
RequestHeader set Accept "*/*" env=!RDF_XML_request

What Designers of a Protocol can learn from OAI-PMH

July 12, 2009 at 5:35 pm | Posted in cgi, programming | 1 Comment

I spent the last couple of days writing an OAI-PMH data provider (two, actually). And to give a positive twist to the things that bothered me about OAI-PMH, I’ll list them as “things not to do in a protocol”:

  • additional arguments are not ignored but are to be handled as error.

    This leads to a lot of additional code just to make sure a data provider is compliant. But for a different OAI-PMH data provider I wrote before it was even worse: I had one data provider serving different repositories, but I couldn’t simply add an additional URL parameter to the base URL to distinguish these.

  • HTTP is just used as transport layer.

    An error response in OAI-PMH is not an HTTP reponse with a status != 200, but rather an XML error message delivered with HTTP 200. This means more programming work on both server and client side.

  • noRecordsMatch error message instead of delivering an empty list.

    Again this means more work on the data provider side – but also for most harvesters i assume. Additionally, since this error can only be detected after running some logic on the server, it is hard to just implement one check_args routine, which is called before any other action and then only handle the success case.

  • The resumption token.

    The flow control mechanism of OAI-PMH is way too specific. It’s completely geared towards stateful data providers. Imagine your resources live in an SQL database; what you’d do to retrieve them in batches is using simple LIMIT and OFFSET settings. But you can’t just put the offset in the resumption token, becaue it is an exclusive argument, i.e. with the next request from the client, you will receive only the resumption token, but none of the other arguments supplied before. So what the typical CGI programmer ends up with is embedding all other arguments in the resumption token, and upon the next request, parsing the resumption token to reconsruct the complete request.

So if you are about to design a protocol on top of HTTP learn from the mistakes of OAI-PMH.

Maximum object size for squid

July 3, 2009 at 5:35 pm | Posted in http, squid | Leave a comment

Note to self: the default maximum_object_size setting for squid is an – to me at least – non-intuitively small 4 MB.

Stop Password Masking?

June 28, 2009 at 8:02 pm | Posted in Uncategorized | Leave a comment

I think Jakob Nielsen has this one wrong. Masking password fields may make it harder for the user, but having seen people log in to applications in live demos – i.e. with a whole bunch of bystanders – tells me that there is a place for this technique.

A note on INNER JOIN and sqlite

April 29, 2009 at 12:20 pm | Posted in Uncategorized | Leave a comment


14:11:15$ time sqlite3 devdata.sqlite "select distinct from source_word_donor_language as swdl inner join source_word as sw inner join word_source_word as wsw inner join word as w inner join language as l on swdl.source_word_id = and = wsw.source_word_id and wsw.word_id = and w.language_name = where swdl.donor_language_id = 873046508145964;"
Imbabura Quechua (Quichua)
Tzotzil of Zinacantan
Wichí [Matacoan]
Ceq Wong
Seychelles Creole

real 4m34.178s
user 3m59.379s
sys 0m32.766s


14:10:43$ time sqlite3 devdata.sqlite "select distinct from source_word_donor_language as swdl, source_word as sw, word_source_word as wsw, word as w, language as l where swdl.source_word_id = and = wsw.source_word_id and wsw.word_id = and w.language_name = and swdl.donor_language_id = 873046508145964;"
Imbabura Quechua (Quichua)
Tzotzil of Zinacantan
Wichí [Matacoan]
Ceq Wong
Seychelles Creole

real 0m12.060s
user 0m12.017s
sys 0m0.048s

Wikipedia says the above SQL statements are equivalent. But that doesn’t mean an SQL engine will treat them equivalently.

Running multiple squid instances on Ubuntu

December 3, 2008 at 7:00 pm | Posted in http | 3 Comments
Tags: ,

As described in earlier posts, our standard web application setup at work is TurboGears behind squid as transparent caching proxy behind Apache. One of the reasons for this setup is that we want fine granular control over the services.

Since we already decided to run each application in its own application server, we want to keep things separate in the squid layer as well. Which brings us to the challenge of running multiple squid instances. It turns out this isn’t too hard to do, reusing most of the standard squid installation on ubuntu.

  1. Create a link to have the squid deamon available under a different name:
    cd /usr/local/sbin
    ln /usr/sbin/squid squid2
  2. Create directories for logs and cache files:
    mkdir /var/log/squid2
    chown proxy:proxy /var/log/squid2
    mkdir /var/spool/squid2
    chown proxy:proxy /var/spool/squid2
  3. Create the configuration in /etc/squid/squid2.conf, specifying pid_filename, cache_dir and access_log in particular.
  4. Create an init script. We started out with the one installed with the package and only had to apply the following changes:
    # diff /etc/init.d/squid2 /etc/init.d/squid
    < NAME=squid2
    < CONF=/etc/squid/$NAME.conf
    < DAEMON=/usr/local/sbin/$NAME
    > NAME=squid
    > DAEMON=/usr/sbin/squid
    < SQUID_ARGS="-D -sYC -f $CONF"
    > SQUID_ARGS="-D -sYC"
    < sq=$CONF
    > sq=/etc/squid/$NAME.conf
    < $DAEMON -z -f $CONF
    > $DAEMON -z
  5. And install the init script running
    update-rc.d squid2 defaults 99

LinkedData with TurboGears and squid

December 3, 2008 at 6:33 pm | Posted in http, programming, python | 1 Comment

At work we run several TurboGears web applications, deployed behind Apache acting as proxy. I like this setup a lot, because it allows us to reuse our log file analysis infrastructure for the web applications as well.

Some of the web applications serve data that rarely changes, so to lower the traffic for TurboGears, I decided to use a transparent cache proxy. Since logging is already taken care of by Apache, I don’t care about not all requests hitting the application server.

We settled on putting a squid cache between Apache and TurboGears, which worked well after some fiddling.

Recently a new requirement came up: Serving Linked Data with these web applications. This is actually pretty easy to do with TurboGears. Creating RDF+XML with the templating engines works well, and even the content negotiation mechanism recommended for serving the data is well supported. To have TurboGears serve the RDF+XML representation for a resource just decorate the corresponding controller as follows:


TurboGears will pick the appropriate template based on the Accept header sent by the client.

Unfortunately this setup – different pages served for the same URL – doesn’t work well with our squid cache. But the Vary HTTP header comes to our rescue. To tell squid that certain HTTP request headers have to be taken into account when caching a response, send back a Vary header to inform squid about this; thus, squid will use a key combined from URL and the significant headers for the cached page.

Now the only header important for our content negotiation requirement is the Accept header, so putting the following line in the controller does the trick:

response.header['Vary'] = 'accept'

Claim your online persona

November 11, 2008 at 9:43 am | Posted in Uncategorized | Leave a comment

Andy powell has an interesting hands-on post on the topic of claiming your online persona:
define:digital identity. What I especially like: he even gives recommendations!

wlan with WPA on ThinkPad T61p with Ubuntu 7.10

November 11, 2008 at 8:24 am | Posted in Uncategorized | Leave a comment

For some reason, i never got wlan with WPA to work right seamlessly on my ThinkPad T61p with ubuntu 7.10. So here’s a short writeup of what it takes.

  1. Stop the network
  2. Put the WPA network information in /etc/wpa_supplicant.conf (see man wpa_supplicant.conf for examples)
  3. Start the wpa supplicant running wpa_supplicant -Dwext -iwlan0 -c/etc/wpa_supplicant.conf
  4. Start the DHCP client running dhclient wlan0

Creating portable picture galleries from flickr photos

November 8, 2008 at 4:09 pm | Posted in Uncategorized | Leave a comment

Having been a flickr user for a couple of years now, my infrastructure for creating backups of the data stored at flickr grew more and more sophisticated. So finally I decided to polish it so that it can be used by others as well and put on google code.

Next Page »

Blog at
Entries and comments feeds.