Friday, September 28, 2012

Web Request Rate Analysis with R

During performance testing it's easy to find raw data relating to the response time of a web service or RPC. This information is typically recorded in log files like the server access log or in some form or another in the load-generating client logs. Response time statistics are an important operational parameter but need to be qualified with the request rate to give context to the metric. I use a simple R script generate request rate graphs from Apache JMeter output (.jtl files) but the concept can easily be extended to any format which includes timestamped entries, preferably labelled somehow to distinguish one service type from another.

The basis of the request rate calculation is calculating and inverting the inter-arrival time of requests. If the generation of requests is independent then the request generator follows a Poisson Process and inter-arrival times will follow an exponential distribution. Generally requests directed at a web site will not be entirely independent as a user will be guided by the site navigation constraints, but the approximation lends itself to modelling of the service using queueing theory. One of the simplest of such models is shown below, an M/M/1 queue, which models the queue length in a system having a single server, where arrivals are determined by a Poisson Process and job service times have an exponential distribution.

Moving on with less esoteric endevours, each request from the Jmeter client is recorded as a separate XML element an a jtl file. Request parameters are recorded as attributes on the element. The attributes we are interested in are ts (epoch timestamp) and lb (service label). The shell command below turns the jtl file into a format that R can read into a dataframe. This includes adding column headers to save us having to define these within the R script. These shell commands are the part you would change if importing data from another source.

Now we run the data file generated above through the R script below. The script begins by loading the data into a dataframe and sorting based on timestamp. We can't take for granted that log entries generated in a buffered multi-threaded environment will be chronologically interleaved. We then iterate over the dataframe, calculating the inter-arrival time between each request and its predecessor. This value is added as a new column v to the dataframe. We then invert and scale the inter-arrival times and plot over time to give us the overall request rate per minute. We then filter the data on service labels and perform the same exercise to get a service-specific request rate breakdown. We have used a locally weighted smoothing function, adjusting the span factor to suit our granularity requirements.

A sample graph is shown below. For this test run we used a rate-controlled request generator with a small ramp, accounting for the initial rise and bounded oscillation for each service. Below that is a density plot of the inter-arrival time for this sample data, showing that it does indeed look exponential, as postulated by the Poisson process model.

Friday, September 21, 2012

Apache Web Stats from Access Logs

Apache access logs represent a wealth of information regarding who/what is hitting your web site/service and how you are handling the traffic. There are many tools available to digest this information and generate useful insight, but just in case you have an itch to do it yourself, here's a bit of scripting to get you on your way. I used venerable old sed(1) to pull out the fields we are interested in and R to generate bar graphs, showing the breakdown of hits by source IP address and agent headers.

The first step is to parse the access logs to make it suitable for  import into R. The exact parsing pattern will depend on the Apache LogFormat definitions and which one is active for the access log. In this case we are assuming the combined format ...

apache2.conf: LogFormat "%h ....." combined

%...h:          Remote host
%...l:          Remote logname
%...u:          Remote user 
%...t:          Time, in common log format time format 
%...r:          First line of request
%...s:          Status.  
%...0:          Bytes sent
%...D:          Response time
%{VARNAME}e:    The contents of the environment variable VARNAME.
%{VARNAME}i:    The contents of VARNAME in request header. 
%{VARNAME}n:    The contents of note VARNAME from another module.

Using sed we extract the remote host address, request path, and first word of
User-Agent header. Output format is space-separated columns: ip path agent

sed -n 's/^\(\S*\) - - \[[^]]*\] "\(GET\|HEAD\) \(\/\S*\) [^"]*" [^"]*"-" "\(\S*\) .*$/\1 \3 \4/p' access.log > access.dat

Next step is to parse and graph in R using the script below. The script imports the normalized log records into a dataframe and then aggregates by ip and agent to generate bar graphs of the hits, broken down by source address and agent-user header repectively.  The aggregates are ordered by count and, for presentation purposes only, truncated after the top 40 classes. Because we do not take timestamps into account, multiple log files can be concatenated in any order and processed together. Below are sample output graphs. Note the prominent representation of Mozilla/5.0 user agent is somewhat misleading. For simplicity sake, the sed expression only extracts the first word of the user agent header which has the effect of grouping together Googlebot, Yandexbot, TweetedTimes, Firefox, and Safari, among others.