## Friday, September 21, 2012

### Apache Web Stats from Access Logs

Apache access logs represent a wealth of information regarding who/what is hitting your web site/service and how you are handling the traffic. There are many tools available to digest this information and generate useful insight, but just in case you have an itch to do it yourself, here's a bit of scripting to get you on your way. I used venerable old sed(1) to pull out the fields we are interested in and R to generate bar graphs, showing the breakdown of hits by source IP address and agent headers.

The first step is to parse the access logs to make it suitable for  import into R. The exact parsing pattern will depend on the Apache LogFormat definitions and which one is active for the access log. In this case we are assuming the combined format ...

apache2.conf: LogFormat "%h ....." combined

%...h:          Remote host
%...l:          Remote logname
%...u:          Remote user
%...t:          Time, in common log format time format
%...r:          First line of request
%...s:          Status.
%...0:          Bytes sent
%...D:          Response time
%{VARNAME}e:    The contents of the environment variable VARNAME.
%{VARNAME}i:    The contents of VARNAME in request header.
%{VARNAME}n:    The contents of note VARNAME from another module.

Using sed we extract the remote host address, request path, and first word of
User-Agent header. Output format is space-separated columns: ip path agent

sed -n 's/^$\S*$ - - $[^]]*$ "$GET\|HEAD$ $\/\S*$ [^"]*" [^"]*"-" "$\S*$ .*\$/\1 \3 \4/p' access.log > access.dat

Next step is to parse and graph in R using the script below. The script imports the normalized log records into a dataframe and then aggregates by ip and agent to generate bar graphs of the hits, broken down by source address and agent-user header repectively.  The aggregates are ordered by count and, for presentation purposes only, truncated after the top 40 classes. Because we do not take timestamps into account, multiple log files can be concatenated in any order and processed together. Below are sample output graphs. Note the prominent representation of Mozilla/5.0 user agent is somewhat misleading. For simplicity sake, the sed expression only extracts the first word of the user agent header which has the effect of grouping together Googlebot, Yandexbot, TweetedTimes, Firefox, and Safari, among others.