Tibor's Musings

Accelerating Web Sites with Varnish

Varnish is a popular HTTP accelerator that can speed up web sites. Here is an example of how to set it up on SLC6 box in view of testing CERN Open Data portal responsiveness.

Installation

Official Varnish packages for Scientific Linux 6 (a distribution that is binary API compatible with CentOS6 and RHEL6) are outdated. To install latest Varnish version 4, one can use Varnish's own package repository:

sudo rpm --nosignature -i https://repo.varnish-cache.org/redhat/varnish-4.0.el6.rpm
sudo yum install -y varnish

Configuration

Let us configure Varnish so that it would listen on port 80 and forward traffic to our web application that runs on port 8080 on the same machine.

First, let's configure Varnish listening port:

sudo perl -pi -e 's,VARNISH_LISTEN_PORT=6081,VARNISH_LISTEN_PORT=80,g' /etc/sysconfig/varnish

and make it use more memory while we are at it:

sudo perl -pi -e 's,VARNISH_STORAGE_SIZE=256M,VARNISH_STORAGE_SIZE=512M,g' /etc/sysconfig/varnish

The web application we are trying to accelerate, in this case CERN Open Data test instance, runs on top of Apache and listens on port 8080 only and on incoming IP address 127.0.0.1 only:

sudo perl -pi -e 's,Listen 80,Listen 8080,g' /etc/httpd/conf/httpd.conf
sudo -u apache perl -pi -e 's,128.142.151.32:80,127.0.0.1:8080,g' /opt/open-data/.virtualenvs/opendata/var/invenio.base-instance/apache/invenio-apache-vhost.conf

After restart of Apache and Varnish:

sudo /etc/init.d/httpd restart
sudo /etc/init.d/varnish restart

we can check that the processes are well listening where they should:

$ sudo netstat -lp | grep varnish
tcp        0      0 *:http                      *:*                         LISTEN      50690/varnishd
tcp        0      0 localhost:6082              *:*                         LISTEN      50685/varnishd
tcp        0      0 *:http                      *:*                         LISTEN      50690/varnishd
$ sudo netstat -lp | grep httpd
tcp        0      0 *:webcache                  *:*                         LISTEN      50592/httpd
unix  2      [ ACC ]     STREAM     LISTENING     5289212 50592/httpd         /opt/open-data/.virtualenvs/opendata/var/run.50592.0.1.sock

and that the web client connecting from laptop sees things as it should:

$ curl -I http://opendata.cern.ch/
HTTP/1.1 200 OK
Date: Thu, 18 Sep 2014 19:35:08 GMT
Server: Apache
Content-Type: text/html; charset=utf-8
X-Varnish: 98611 3
Age: 99
Via: 1.1 varnish-v4
Content-Length: 8934
Connection: keep-alive

However, due to Varnish proxy, Apache log sees all incoming requests as coming from 127.0.0.1:

$ tail /opt/open-data/.virtualenvs/opendata/var/log/apache.log
127.0.0.1 - - [18/Sep/2014:21:24:15 +0200] "GET /gen/almond.js?5127e506 HTTP/1.1" 304 - "http://opendata.cern.ch/" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.1.0" 660
127.0.0.1 - - [18/Sep/2014:21:24:15 +0200] "GET /gen/invenio.js?8e21d7fc HTTP/1.1" 304 - "http://opendata.cern.ch/" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.1.0" 606
127.0.0.1 - - [18/Sep/2014:21:24:15 +0200] "GET /gen/jquery.js?a6392293 HTTP/1.1" 304 - "http://opendata.cern.ch/" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.1.0" 1019

Let's fix it.

mod_rpaf

A reverse proxy add forward mod_rpaf module can help us here. However, it is not available for RHEL6 out of the box.

One could take it from CentOS 6, the binary API compatible distribution:

sudo rpm -ivh ftp://ftp.pbone.net/mirror/ftp5.gwdg.de/pub/opensuse/repositories/Apache:/Modules/CentOS_CentOS-6/x86_64/mod_rpaf-0.6-1.2.x86_64.rpm
for configoption in "LoadModule rpaf_module modules/mod_rpaf-2.0.so" \
        "RPAFenable On" \
        "RPAFsethostname On" \
        "RPAFproxy_ips 127.0.0.1 ::1" \
        "RPAFheader X-Forwarded-For"; do
    if ! grep -q "${configoption}" /etc/httpd/conf.d/mod_rpaf.conf; then
        echo "${configoption}" | sudo tee -a /etc/httpd/conf.d/mod_rpaf.conf
    fi
done

We can also easily compile it ourselves:

cd /tmp
wget http://www.stderr.net/apache/rpaf/download/mod_rpaf-0.6.tar.gz
tar xvfz mod_rpaf-0.6.tar.gz
cd mod_rpaf-0.6
sudo yum install -y httpd-devel
sudo apxs -i -c -n mod_rpaf-2.0.so mod_rpaf-2.0.c
# gives /usr/lib64/httpd/modules/mod_rpaf-2.0.so

Once available, let's configure mod_rpaf as follows:

$ sudo vim /etc/httpd/conf.d/mod_rpaf.conf
$ cat /etc/httpd/conf.d/mod_rpaf.conf
LoadModule rpaf_module modules/mod_rpaf-2.0.so
RPAFenable On
RPAFsethostname On
RPAFproxy_ips 127.0.0.1 ::1
RPAFheader X-Forwarded-For

After restarting Apache, we see real IP addresses in the apache log:

86.209.237.81 - - [18/Sep/2014:21:59:27 +0200] "POST /results/83ce2e1d87cb0b8a190d34e69cba4786 HTTP/1.1" 200 43145 "http://opendata.cern.ch/search" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.1.0" 490552
86.209.237.81 - - [18/Sep/2014:21:59:27 +0200] "GET /facet/collection/83ce2e1d87cb0b8a190d34e69cba4786?parent=CMS-Derived-Datasets HTTP/1.1" 200 13 "http://opendata.cern.ch/search" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.1.0" 12762
86.209.237.81 - - [18/Sep/2014:21:59:27 +0200] "POST /results/83ce2e1d87cb0b8a190d34e69cba4786 HTTP/1.1" 200 43145 "http://opendata.cern.ch/search" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.1.0" 968808
86.209.237.81 - - [18/Sep/2014:21:59:27 +0200] "POST /results/83ce2e1d87cb0b8a190d34e69cba4786 HTTP/1.1" 200 24569 "http://opendata.cern.ch/search" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.1.0" 957070
86.209.237.81 - - [18/Sep/2014:21:59:27 +0200] "POST /results/83ce2e1d87cb0b8a190d34e69cba4786 HTTP/1.1" 200 24569 "http://opendata.cern.ch/search" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.1.0" 4294376

All is well; the basic configuration is finished.

Massaging cookies

To profit from the web acceleration, we can speed things by configuring Varnish to cache everything coming from the backend application except for the search pages. Let's do this regardless of what our backend application says. This is because our application does not offer any user-specific functionality that would differentiate one guest user from another. Every use is treated equally, there is no login, no restricted data, etc.

The Invenio back-end application currently does not handle Set-Cookie very nicely, see issue #2291. Let us assume therefore that we need to remove this header from all application responses, except for search pages.

(Actually, we can cache also the search pages, but let's assume that we would like to amend cookies coming from the backend application only for certain URLs. This "advanced" configuration may be needed in later production.)

Let's start by cloning default configuration:

sudo cp /etc/varnish/default.vcl /etc/varnish/opendata.vcl
sudo vim /etc/varnish/opendata.vcl

Let's introduce the following differences, where we basically unset req.http.Cookie and beresp.http.set-cookie for all pages except our wanted /search URL:

sudo diff -u /etc/varnish/default.vcl /etc/varnish/opendata.vcl
--- /etc/varnish/default.vcl    2014-06-24 11:40:31.000000000 +0200
+++ /etc/varnish/opendata.vcl   2014-09-18 21:30:20.940161421 +0200
@@ -23,6 +23,10 @@
     #
     # Typically you clean up the request here, removing cookies you don't need,
     # rewriting the request, etc.
+
+    if (!(req.url ~ "^/search")) {
+        unset req.http.Cookie;
+    }
 }

 sub vcl_backend_response {
@@ -30,6 +34,11 @@
     #
     # Here you clean the response headers, removing silly Set-Cookie headers
     # and other mistakes your backend does.
+
+     if (!(bereq.url ~ "^/search")) {
+        unset beresp.http.set-cookie;
+        set beresp.ttl = 1h;
+    }
 }

 sub vcl_deliver {

We can activate new configuration like this:

sudo perl -pi -e 's,VARNISH_VCL_CONF=/etc/varnish/default.vcl,VARNISH_VCL_CONF=/etc/varnish/opendata.vcl,g' /etc/sysconfig/varnish

Performance measurements

Let's measure the response time speed up via Apache ab tool. The old configuration gives:

laptop> ab -n 100 -c 5 http://opendata.cern.ch/
Requests per second:    32.38 [#/sec] (mean)

Restarting varnish with the new configuration gives:

$ sudo service varnish restart
$ ab -n 100 -c 5 http://opendata.cern.ch/
Requests per second:    52.36 [#/sec] (mean)

We can serve 52 reqs/sec vs 32 reqs/sec. This does not seem much in terms of increase, but this measurement was done over a slow ADSL line which limits the throughput somewhat.

Here is throughput comparison on the server itself:

$ ab -n 100 -c 5 http://127.0.0.1:80/

Total transferred:      946000 bytes
HTML transferred:       923100 bytes
Requests per second:    2156.52 [#/sec] (mean)
Time per request:       2.319 [ms] (mean)
Time per request:       0.464 [ms] (mean, across all concurrent requests)
Transfer rate:          19922.54 [Kbytes/sec] received

$ ab -n 100 -c 5 http://127.0.0.1:8080/

Total transferred:      931641 bytes
HTML transferred:       909141 bytes
Requests per second:    72.28 [#/sec] (mean)
Time per request:       69.175 [ms] (mean)
Time per request:       13.835 [ms] (mean, across all concurrent requests)
Transfer rate:          657.61 [Kbytes/sec] received
#+END_EXAMPLE

We are much, much faster; 21k reqs/sec vs 72 reqs/sec.

Slashdot effect

Let's try to increase the number of client connections and observe response times when simulating 5 and 100 concurrent users:

laptop> ab -n 100 -c 5 http://opendata.cern.ch/
Requests per second:    52.36 [#/sec] (mean)

laptop> ab -n 1000 -c 100 http://opendata.cern.ch/
Requests per second:    57.78 [#/sec] (mean)

The cache can easily serve such increased traffic, because the pages are served from memory via efficient event-driver model.

Note that proper user scalability test would require distributed testing with some backend heat processes, e.g. via siege. However we are interested here in a rule of thumb only.

Reboot-persistent configuration

How to make Varnish run after reboot:

$ sudo chkconfig | grep http
httpd           0:off   1:off   2:on    3:on    4:on    5:on    6:off
$ sudo chkconfig | grep varnish
varnish         0:off   1:off   2:off   3:off   4:off   5:off   6:off
varnishlog      0:off   1:off   2:off   3:off   4:off   5:off   6:off
varnishncsa     0:off   1:off   2:off   3:off   4:off   5:off   6:off
$ sudo chkconfig varnish on
$ sudo chkconfig | grep varnish
varnish         0:off   1:off   2:on    3:on    4:on    5:on    6:off
varnishlog      0:off   1:off   2:off   3:off   4:off   5:off   6:off
varnishncsa     0:off   1:off   2:off   3:off   4:off   5:off   6:off

Nicer error page

By default the Varnish error page is not "user-friendly-nice". E.g. stop Apache and observe "Error 503 Backend fetch failed".

To make the error page simpler and to hide Varnish server signature, we can edit vcl_backend_error:

$ sudo vim /etc/varnish/opendata.vcl
$ sudo diff -u /etc/varnish/default.vcl /etc/varnish/opendata.vcl

[...]
+
+sub vcl_backend_error {
+    set beresp.http.Content-Type = "text/html; charset=utf-8";
+    set beresp.http.Retry-After = "5";
+    synthetic( {"<!DOCTYPE html>
+<html>
+  <head>
+    <title>"} + beresp.status + " " + beresp.reason + {"</title>
+  </head>
+  <body>
+    <h1>Error "} + beresp.status + " " + beresp.reason + {"</h1>
+  </body>
+</html>
+"} );
+    return(deliver);
+}

Logging

To enable logging of incoming queries on the Varnish level, do:

$ sudo /etc/init.d/varnishncsa start

$ cat /var/log/varnish/varnishncsa.log
128.141.95.173 - - [05/Nov/2014:13:09:15 +0100] "GET http://opendata.cern.ch/ HTTP/1.1" 200 11612 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.1.0"
128.141.95.173 - - [05/Nov/2014:13:09:15 +0100] "GET http://opendata.cern.ch/gen/opendata.css?eb2f0489 HTTP/1.1" 200 0 "http://opendata.cern.ch/" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.1.0"
128.141.95.173 - - [05/Nov/2014:13:09:15 +0100] "GET http://opendata.cern.ch/gen/invenio.css?56a680c2 HTTP/1.1" 200 0 "http://opendata.cern.ch/" "Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.1.0"
128.141.164.203 - - [05/Nov/2014:13:09:39 +0100] "GET http://opendata.cern.ch/ HTTP/1.1" 200 11597 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0"

Conclusions

Varnish is a widely used HTTP accelerator for web applications. The use for the CERN Open Data portal seems perfectly plausible. One can relatively easily configure it to amend Set-Cookie for certain pages in case of (buggy) web application. The setup on the SLC6 platform seems stable under heavy load.