Work the Shell - Analyzing Log Files Redux
Last month, we spent a lot of time digging around in the Apache log files to see how you can use basic Linux commands to ascertain some basic statistics about your Web site.
You'll recall that even simple commands, such as head, tail and wc can help you figure out things like hits per hour and, coupled with some judicious uses of grep, can show you how many graphics you sent, which HTML files were most popular and so on.
More important, utilizing awk at its most rudimentary made it easy to cut out a specific column of information and see that different fields of a standard Apache log file entry have different values. This month, I dig further into the log files and explore how you can utilize more sophisticated scripting to ascertain total bytes transferred for a given time unit.
Many ISPs have a maximum allocation for your monthly bandwidth, so it's important to be able to figure out how much data you've sent. Let's examine a single log file entry to see where the bytes-sent field is found:
72.82.44.66 - - [11/Jul/2006:22:15:14 -0600] "GET ↪/individual-entry-javascript.js HTTP/1.1" 200 2374 ↪"http://www.askdavetaylor.com/ ↪sync_motorola_razr_v3c_with_windows_xp_via_bluetooth.html" ↪"Mozilla/4.0 (compatible; MSIE 6.0; ↪Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR ↪2.0.50727)"
There are a lot of different fields of data here, but the one we want is field #10, which in this instance is 2374. Double-check on the filesystem, and you'll find out that this is the size, in bytes, of the file sent, whether it be a graphic, HTML file or, as in this case, a JavaScript include file.
Therefore, if you can extract all the field #10 values in the log file and summarize them, you can figure out total bytes transferred. Extracting the field is easy; adding it all up is trickier, however:
$ awk '{ print $10 }' access_log
That gets us all the transfer sizes, and we can use awk's capabilities to make summarizing a single-line command too:
$ awk '{ sum += $10 } END { print sum }' access_log
As I have said before, awk has lots of power for those people willing to spend a little time learning its ins and outs. Notice a lazy shortcut here: I'm not initializing the variable sum, just exploiting the fact that variables, when first allocated in awk, are set to zero. Not all scripting languages offer this shortcut!
Anyway, run this little one-liner on an access log, and you can see the total number of bytes transferred: 354406825. I can divide that out by 1024 to figure out kilobytes, megabytes and so on, but that's not useful information until we can figure out one more thing: what length of time is this covering?
We can calculate elapsed time by looking at the first and last lines of the log file and calculating the difference, or we simply can use grep to pull one day's worth of data out of the log file and then multiply the result by 30 to get a running average monthly transfer rate.
Look back at the log file entry; the date is formatted like so: - [11/Jul/2006:22:15:14 -0600]. Ignore everything other than the fact that the date format is DD/MMM/YYYY.
I'll test it with 08/Aug/2006 to pull out just that one day's worth of log entries and then feed it into the awk script:
$ grep "08/Aug/2006" access_log | awk '{ sum += $10 } ↪END { print sum }' 78233022
Just a very rough estimate: 78MB. Multiply that by 30 and we'll get 2.3GB for that Web site's monthly data transfer rate.
Now, let's turn this into an actual shell script. What I'd like to do is pull out the previous day's data from the log file and automatically multiply it by 30, so any time the command is run, we can get a rough idea of the monthly data transfer rate.
The first step is to do some date math. I am going to make the rash assumption that you have GNU date on your system, which allows date math. If not, well, that's beyond the scope of this piece, though I do talk about it in my book Wicked Cool Shell Scripts (www.intuitive.com/wicked).
GNU date lets you back up arbitrary time units by using the -v option, with modifiers. To back up a day, use -v-1d. For example:
$ date Wed Aug 9 01:00:00 GMT 2006 $ date -v-1d Tue Aug 8 01:00:47 GMT 2006
The other neat trick the date command can do is to print its output in whatever format you need, using the many, many options detailed in the strftime(3) man page. To get DD/MMM/YYYY, we add a format string:
$ date -v-1d +%d/%b/%Y 08/Aug/2006
Now, let's start pulling the script together. The first step in the script is to create this date string so we can use it for the grep call, then go ahead and extract and summarize the bytes transferred that day. Next, we can use those values to calculate other values with the expr command, saving everything in variables so we can have some readable output at the end.
Here's my script, with just a little bit of fancy footwork:
#!/bin/sh LOGFILE="/home/limbo1/logs/intuitive/access_log" yesterday="$(date -v-1d +%d/%b/%Y)" # total number of "hits" and "bytes" yesterday: hits="$(grep "$yesterday" $LOGFILE | wc -l)" bytes="$(grep "$yesterday" $LOGFILE | awk '{ sum += $10 } END { print sum }')" # now let's play with the data just a bit avgbytes="$(expr $bytes / $hits )" monthbytes="$(expr $bytes \* 30 )" # calculated, let's now display the results: echo "Calculating transfer data for $yesterday" echo "Sent $bytes bytes of data across $hits hits" echo "For an average of $avgbytes bytes/hit" echo "Estimated monthly transfer rate: $monthbytes" exit 0
Run the script, and here's the kind of data you'll get (once you point the LOGFILE variable to your own log):
$ ./transferred.sh Calculating transfer data for 08/Aug/2006 Sent 78233022 bytes of data across 15093 hits For an average of 5183 bytes/hit Estimated monthly transfer rate: 2346990660
We've run out of space this month, but next month, we'll go back to this script and add some code to have the transfer rates displayed in megabytes or, if that's still too big, gigabytes. After all, an estimated monthly transfer rate of 2346990660 is a value that only a true geek could love!
Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com.