Hack and / - Chopping Logs

by Kyle Rankin

If you are a sysadmin, logs can be both a bane and a boon to your existence. On a bad day, a misbehaved program could dump gigabytes of errors into its log file, fill up the disk and light up your pager like a Christmas tree. On a good day, logs show you every clue you need to track down any of a hundred strange system problems. Now, if you manage any Web servers, logs provide even more valuable information in terms of statistics. How many visitors did you get to your main index page today? What spider is hammering your site right now?

Many excellent log-analysis tools exist. Some provide really nifty real-time visualizations of Web traffic, and others run every night and generate manager-friendly reports for you to browse. All of these programs are great, and I suggest you use them, but sometimes you need specific statistics and you need them now. For these on-the-fly statistics, I've developed a common template for a shell one-liner that chops through logs like Paul Bunyan.

What I've found is that although the specific type of information I need might change a little, for the most part, the algorithm remains mostly the same. For any log file, each line contains some bit of unique information I need. Then, I need to run through the log file, identify that information and keep a running tally that increments each time I see the particular pattern. Finally, I need to output that information along with its final tally and sort based on the tally.

There are many ways you can do this type of log parsing. Old-school command-line junkies might prefer a nice sed and awk approach. The whipper-snappers out there might pick a nicely formatted Python script. There's nothing at all wrong with those approaches, but I suppose I fall into the middle-child scripting category—I prefer Perl for this kind of text hacking. Maybe it's the power of Perl regular expressions, or maybe it's how easy it is to use Perl hashes, or maybe it's just what I'm most comfortable with, but I just seem to be able to hack out this kind of script much faster in Perl.

Before I give a sample script though, here's a more specific algorithm. The script parses through each line of input and uses a regular expression to match a particular column or other pattern of data on the line. It then uses that pattern as a key in a hash table and increments the value of that key. When it's done accepting input, the script iterates through each key in the hash and outputs the tally for that key and the key itself.

For the test case, I use a general-purpose problem you can try yourself, as long as you have an Apache Web server. I want to find out how many unique IP addresses visited one of my sites on November 1, 2008, and the top ten IPs in terms of hits.

Here's a sample entry from the log (the IP has been changed to protect the innocent):

123.123.12.34 - - [01/Nov/2008:19:34:02 -0700] "GET 
 ↪/talks/pxe/ui/default/iepngfix.htc HTTP/1.1" 
 ↪404 308 "-" "Mozilla/4.0 (compatible; MSIE 7.0; 
 ↪Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; 
 ↪Media Center PC 5.0; .NET CLR 3.0.04506; InfoPath.2)"

And, here's the one-liner that can parse the file and provide sorted output:

perl -e 'while(<>){ if( m|(^\d+\.\d+\.\d+\.\d+).*?
↪01/Nov/2008| ){ $v{$1}++; } } foreach( keys 
 ↪%v ){ print "$v{$_}\t$_\n"; }' 
 ↪/var/log/apache/access.log | sort -n

When you run this command, you should see output something like the following only with more lines and IPs that aren't fake:

33      99.99.99.99
94      111.111.111.111
138     15.15.15.15

For those of you who know and love both Perl and regular expressions, that one-liner probably isn't too difficult to parse, but for the rest of you, let's go step by step. Sometimes it's easier to go through a one-liner if you see it in a formatted way, so here's the Perl part of the one-liner translated as though it were in a regular file:


#!/usr/bin/perl

while(<>){ 
   if(m|(^\d+\.\d+\.\d+\.\d+).*?01/Nov/2008|){ 
      $v{$1}++; 
   } 
}

foreach( keys %v ){ 
   print "$v{$_}\t$_\n"; 
}

First, let's discuss the while loop. Basically, while(<>) iterates over every line of input it receives either through a pipe or as a file argument on the command line. Inside this loop, I set up a regular expression to match and pull out an IP address. The regular expression is probably worth looking at in more detail:

(^\d+\.\d+\.\d+\.\d+)

This section of the regular expression matches the beginning of the line (^), then any amount of numbers (\d+), and then a dot, another series of numbers, another dot, another series of numbers, another dot and finally a fourth series of numbers. This pattern will match, for instance, 123.123.12.34 at the beginning of a line. I surrounded this part of the regular expression in parentheses. Because this is the first set of parentheses, when Perl matches it, it puts the resultant match into the $1 variable so I can pull it out later.

Now, those of you who know regular expressions know that I cheated here. This regular expression is not very explicit at all. For one, it would match completely invalid IP addresses, such as 999.999.999.999. For another, it even would match any series of four numbers with dots in between, such as 12345.6.7.8910. I chose an overly generic regular expression on purpose to make a point. There are explicit regular expressions that match only valid IP addresses, but those expressions are very long, very complex and, in this case, completely unnecessary.

Because I'm dealing with Apache logs, I am pretty confident that the first set of numbers at the beginning of the file is an IP address and not something else, and second, the IP address that Apache logged should be reasonably valid. In taking the shortcut, I not only saved on typing, but the resulting regular expression also is easier to read and understand even if you aren't a regex wizard.

After I match the IP, I want to match only log entries from November 01, 2008:

.*?01/Nov/2008

This section performs matches on any number of characters (.*), and with the question mark at the end, it matches only as much as it needs to and no more. Then, it matches the datestamp for November 01, 2008. If I wanted a tally of every day in the log file, I could omit this entire section of the regular expression. Alternatively, if I wanted to match on some other keyword (for instance, when the user performed a GET on a particular file), I could replace or augment the above section with that keyword.

Once I have matched the IP address in a line and have assigned it to $1, I then use it as a key in a hash I call %v here and increment it ($h{$1}++). The power of a hash is that it forces each key to be unique. That means each time I come across a new IP, it will get its own key in the hash and have its value incremented. So, if it's the first time I see the IP, its value will be one. The second time I see the IP, it will increment it to two and so on.

Once I'm done iterating through each line in the file, I then drop to a foreach loop:

foreach( keys %v ){ 
   print "$v{$_}\t$_\n"; 
}

Basically, all this does is increment through every key in the hash and output its value (the number of times I matched that IP in the file) and the IP itself. Note that I didn't sort the values here. I very well could have—Perl has powerful methods to sort output—but to make the code simpler and more flexible, I opted to pipe the output to the command-line sort command. That way, even if you don't know Perl too well but know the command line, you could tweak arguments in sort to reverse the output or even pipe it further to tail, so you could see only the top ten IPs.

If I want to know only the overall number of unique visitors, as each line represents a unique visitor, I just need to count the overall number of lines. To do this, I simply need to pipe the output to wc -l.

And, there you have it, a quick-and-dirty one-liner to chop through your logs and tally results. The beauty of using Perl hashes for this is that you can tweak the regular expression to match all sorts of values in the file—not just IP addresses—and tally all sorts of useful information. I've used modified versions of the script to count how many times a particular file was downloaded by unique IPs, and I've even used it to perform statistics on mailq output.

Kyle Rankin is a Senior Systems Administrator in the San Francisco Bay Area and the author of a number of books, including Knoppix Hacks and Ubuntu Hacks for O'Reilly Media. He is currently the president of the North Bay Linux Users' Group.

Load Disqus comments