Vary: Accept and caching

January 26, 2010 at 6:58 pm | Posted in apache, http, squid | Leave a comment

Ok, so the kind of content negotiation necessary to serve linked data does not prevent using squid as a transparent cache proxy: Vary: Accept to the rescue!

Unfortunately it turns out that there are quite a few different Accept headers around, although probably an order of magnitude less than User-Agent headers. On a medium-traffic site this reduces the effect of caching almost to a server-side replication of a single machine’s browser cache. (In fact, considering that a Vary: Accept header prevents IE6 from caching anything, in some cases it recreates the browser cache with the added latency.)

So in a situation where the only content negotiation our application actually does is whether to serve application/rdf+xml or do the default thing, what can we do? Obviously, reducing the number of different Accept headers that are sent to squid could help. It turns out that this is pretty easy to accomplish (in case an Apache httpd is running in front of squid):

SetEnvIf Accept ".*application\/rdf\+xml.*" RDF_XML_request
RequestHeader unset Accept env=!RDF_XML_request

Whenever the Accept header contains application/rdf+xml, we leave it intact (hoping that the number of different Accept headers fulfilling this condition will be rather small), otherwise we simply remove it, forcing the upstream application to resort to the default behaviour.

Of course as soon as the upstream application does more in terms of content negotiation than simply checking whether application/rdf+xml is the best match for a request, this scheme breaks down. But are there any other problems?

Update: The above configuration works for Apache httpd 2.2.x, for 2.0.x I was successful with

SetEnvIf Accept ".*application\/rdf\+xml.*" RDF_XML_request
RequestHeader set Accept "*/*" env=!RDF_XML_request

Maximum object size for squid

July 3, 2009 at 5:35 pm | Posted in http, squid | Leave a comment

Note to self: the default maximum_object_size setting for squid is an – to me at least – non-intuitively small 4 MB.

Running multiple squid instances on Ubuntu

December 3, 2008 at 7:00 pm | Posted in http | 3 Comments
Tags: ,

As described in earlier posts, our standard web application setup at work is TurboGears behind squid as transparent caching proxy behind Apache. One of the reasons for this setup is that we want fine granular control over the services.

Since we already decided to run each application in its own application server, we want to keep things separate in the squid layer as well. Which brings us to the challenge of running multiple squid instances. It turns out this isn’t too hard to do, reusing most of the standard squid installation on ubuntu.

  1. Create a link to have the squid deamon available under a different name:
    cd /usr/local/sbin
    ln /usr/sbin/squid squid2
  2. Create directories for logs and cache files:
    mkdir /var/log/squid2
    chown proxy:proxy /var/log/squid2
    mkdir /var/spool/squid2
    chown proxy:proxy /var/spool/squid2
  3. Create the configuration in /etc/squid/squid2.conf, specifying pid_filename, cache_dir and access_log in particular.
  4. Create an init script. We started out with the one installed with the package and only had to apply the following changes:
    # diff /etc/init.d/squid2 /etc/init.d/squid
    8,10c8,9
    < NAME=squid2
    < CONF=/etc/squid/$NAME.conf
    < DAEMON=/usr/local/sbin/$NAME
    ---
    > NAME=squid
    > DAEMON=/usr/sbin/squid
    13c12
    < SQUID_ARGS="-D -sYC -f $CONF"
    ---
    > SQUID_ARGS="-D -sYC"
    38c37
    < sq=$CONF
    ---
    > sq=/etc/squid/$NAME.conf
    82c81
    < $DAEMON -z -f $CONF
    ---
    > $DAEMON -z
  5. And install the init script running
    update-rc.d squid2 defaults 99
    .

LinkedData with TurboGears and squid

December 3, 2008 at 6:33 pm | Posted in http, programming, python | 1 Comment
Tags:

At work we run several TurboGears web applications, deployed behind Apache acting as proxy. I like this setup a lot, because it allows us to reuse our log file analysis infrastructure for the web applications as well.

Some of the web applications serve data that rarely changes, so to lower the traffic for TurboGears, I decided to use a transparent cache proxy. Since logging is already taken care of by Apache, I don’t care about not all requests hitting the application server.

We settled on putting a squid cache between Apache and TurboGears, which worked well after some fiddling.

Recently a new requirement came up: Serving Linked Data with these web applications. This is actually pretty easy to do with TurboGears. Creating RDF+XML with the templating engines works well, and even the content negotiation mechanism recommended for serving the data is well supported. To have TurboGears serve the RDF+XML representation for a resource just decorate the corresponding controller as follows:

@expose(as_format="rdf",
format="xml",
template="...",
content_type="application/rdf+xml",
accept_format="application/rdf+xml")

TurboGears will pick the appropriate template based on the Accept header sent by the client.

Unfortunately this setup – different pages served for the same URL – doesn’t work well with our squid cache. But the Vary HTTP header comes to our rescue. To tell squid that certain HTTP request headers have to be taken into account when caching a response, send back a Vary header to inform squid about this; thus, squid will use a key combined from URL and the significant headers for the cached page.

Now the only header important for our content negotiation requirement is the Accept header, so putting the following line in the controller does the trick:

response.header['Vary'] = 'accept'

Squid as accelerator for TurboGears

April 16, 2008 at 5:49 am | Posted in http, programming | 3 Comments

I’m trying to use squid as transparent caching proxy between apache httpd and a – largely read-only – TurboGears web application.

Apache already acts as proxy, sending requests to the web app which listens on a different port. But using mod_cache was not an option, because for apache 2.0 it’s still experimental and additionally it doesn’t seem to work well with mod_rewrite.

So the idea was, to simply plug in squid.

The main problem so far was to narrow down the monumental squid configuration to the few lines i actually need. This is what i came up with so far:


http_port 127.0.0.1:8888 accel defaultsite=site.served.by.web.app
cache_peer 127.0.0.1 parent 8080 0 no-query originserver
refresh_pattern ^http: 1440 20% 1440 override-expire override-lastmod ignore-no-cache ignore-reload ignore-private
acl all src 0.0.0.0/0.0.0.0
acl our_sites dstdomain 127.0.0.1
http_access allow all
http_access allow our_sites

The http_port directive tells squid to listen on port 8888, in accelerator mode, proxying the dafault site.

The cache_peer directive specifies 127.0.0.1:8080 – i.e. the web app – as only cache peer. So whenever squid cannot serve a request from its cache, this is where it will turn to for help. The last three tokens 0 no-query originserver basically say that this is not another squid proxy, by setting the ICP port to 0.

The refresh_pattern directive specifies the rules according to which an item in the cache is to be treated as fresh or stale. In this case, all items with an http URL will be regarded as fresh for one day (1440 minutes). The options override-expire override-lastmod ignore-no-cache ignore-reload ignore-private basically override whatever either client or web app say about caching – so this setup is NOT an http compliant cache. But that’s alright, since we only cache stuff that we are the producers of, so we should know.

I didn’t spend much time investigationg the access control settings, since i figure my setup – squid only listening on an internal port – does already away with most security concerns.

So this is what the results look like in squid’s access log:


1208325630.099 688 127.0.0.1 TCP_MISS/200 11103 GET http://localhost/feature/28 - FIRST_UP_PARENT/127.0.0.1 text/html
1208325634.274 1 127.0.0.1 TCP_HIT/200 11109 GET http://localhost/feature/28 - NONE/- text/html

The second token is the number of milliseconds squid needed to process the requests.

mod_proxy

January 9, 2008 at 10:32 am | Posted in apache, http | Leave a comment

After fiddling around trying to get a proxy rewrite rule to work yet again, it was time for a blog post.

On ubuntu 6.10, things were easy:

a2enmod proxy

and everything is fine.

On SuSE SLES 10, things were harder: Load proxy_http explicitely, too, and in the correct order!

On ubuntu 7.10:

a2enmod proxy

didn’t cut it. After some tearing out of hairs, I remembered this proxy_http thing.

a2enmod proxy_http

informs me, that proxy_http is already enabled as dependency of proxy. It still adds a new link to proxy_http.load in mods-enabled. Apparently the line loading proxy_http has moved from proxy.load to its own load config.

The Power of CGI

September 19, 2007 at 7:50 am | Posted in cgi, http, python | 4 Comments

Today, with all these web application frameworks around, CGI has almost become obsolete. At least in my toolbox, it’s slid to the bottom. And whenever I stumble upon it, I have to look up the spec.

Recently a colleague of mine had the following problem: He wanted to hand out cookies to passers-by, i.e. redirect requests but making sure, the user agents have a cookie when requesting the specified location.

First idea: A job for Apache’s mod_rewrite. But there’s no way to add a Set-Cookie header with mod_rewrite alone. So mod_headers should do. But the mod_headers directives are not evaluated because mod_rewrite has already returned the redirect response.

So CGI to the rescue. But how do you set a particular response status or trigger a redirect via CGI? Tha’s what the spec says:

Parsed headers

The output of scripts begins with a small header. This header consists of text lines, in the same format as an HTTP header, terminated by a blank line (a line with only a linefeed or CR/LF). Any headers which are not server directives are sent directly back to the client. Currently, this specification defines three server directives:

  • Content-type This is the MIME type of the document you are returning.
  • Location This is used to specify to the server that you are returning a reference to a document rather than an actual document.If the argument to this is a URL, the server will issue a redirect to the client.If the argument to this is a virtual path, the server will retrieve the document specified as if the client had requested that document originally. ? directives will work in here, but # directives must be redirected back to the client.
  • Status This is used to give the server an HTTP/1.0 status line to send to the client. The format is nnn xxxxx, where nnn is the 3-digit status code, and xxxxx is the reason string, such as “Forbidden”.

Voila. Something like this does the trick:

print "Status: 302 Found"
print "Location: /"
print "Set-Cookie: key=value; path=/; expires=Wednesday, 09-Nov-07 23:12:40"
print

Create a free website or blog at WordPress.com.
Entries and comments feeds.