The first step is to parse the access logs to make it suitable for import into R. The exact parsing pattern will depend on the Apache LogFormat definitions and which one is active for the access log. In this case we are assuming the combined format ...
apache2.conf: LogFormat "%h ....." combined
%...h: Remote host
%...l: Remote logname
%...u: Remote user
%...t: Time, in common log format time format
%...r: First line of request
%...s: Status.
%...0: Bytes sent
%...D: Response time
%{VARNAME}e: The contents of the environment variable VARNAME.
%{VARNAME}i: The contents of VARNAME in request header.
%{VARNAME}n: The contents of note VARNAME from another module.
apache2.conf: LogFormat "%h ....." combined
%...h: Remote host
%...l: Remote logname
%...u: Remote user
%...t: Time, in common log format time format
%...r: First line of request
%...s: Status.
%...0: Bytes sent
%...D: Response time
%{VARNAME}e: The contents of the environment variable VARNAME.
%{VARNAME}i: The contents of VARNAME in request header.
%{VARNAME}n: The contents of note VARNAME from another module.
Using sed we extract the remote host address, request path, and first word of
User-Agent header. Output format is space-separated columns: ip path agent
sed -n 's/^\(\S*\) - - \[[^]]*\] "\(GET\|HEAD\) \(\/\S*\) [^"]*" [^"]*"-" "\(\S*\) .*$/\1 \3 \4/p' access.log > access.dat
sed -n 's/^\(\S*\) - - \[[^]]*\] "\(GET\|HEAD\) \(\/\S*\) [^"]*" [^"]*"-" "\(\S*\) .*$/\1 \3 \4/p' access.log > access.dat
Next step is to parse and graph in R using the script below. The script imports the normalized log records into a dataframe and then aggregates by ip and agent to generate bar graphs of the hits, broken down by source address and agent-user header repectively. The aggregates are ordered by count and, for presentation purposes only, truncated after the top 40 classes. Because we do not take timestamps into account, multiple log files can be concatenated in any order and processed together. Below are sample output graphs. Note the prominent representation of Mozilla/5.0 user agent is somewhat misleading. For simplicity sake, the sed expression only extracts the first word of the user agent header which has the effect of grouping together Googlebot, Yandexbot, TweetedTimes, Firefox, and Safari, among others.
No comments:
Post a Comment