Work the Shell - Analyzing Log Files
October 1st, 2006 by Dave Taylor in
If you're running Apache, and you probably are, you've got a file called access_log on your server, probably in /etc/httpd or some similar directory. Find it (you can use locate or find if needed).
First, let's see how many hits you've received—that is, how many individual files have been served up. Use the wc program to do this:
$ wc -l access_log 83764 access_log
Interesting, but is that for an hour or a month? The way to find out is to look at the first and last lines of the access_log itself, easily done with head and tail:
$ head -1 access_log 140.192.64.26 - - [11/Jul/2006:16:00:59 -0600] ↪"GET /favicon.ico HTTP/1.1" 404 36717 "-" "-" $ tail -1 access_log 72.82.44.66 - - [11/Jul/2006:22:15:14 -0600] ↪"GET /individual-entry-javascript.js HTTP/1.1" ↪200 2374 "http://www.askdavetaylor.com/ ↪sync_motorola_razr_v3c_with_windows_xp_via_bluetooth.html" ↪"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; ↪.NET CLR 1.1.4322; .NET CLR 2.0.50727)"
These log file lines can be darn confusing, so don't panic if you look at that and become completely baffled. The good news is it's not important to know what every field details. In fact, all we care about is the date and time in square brackets, and the name of the individual file requested after the “GET” line.
Here you can see that the first line in the access log is from 11 July at 16:00:59 and the last line is from 11 July at 22:15:14. Calculate this out, and we're talking a window of about six hours and 15 minutes, or 375 minutes. Divide the total number of hits by this time passage, and we're seeing 223 hits per minute, or a pretty impressive traffic level of 3.7 hits per second.
The second common query is to ascertain which files are requested most often, and that's something we can ascertain with a quick call to awk to split that field from the log file lines, then a combination of sort and uniq with its ever-useful -c option.
Let's take this one step at a time.
If you go back to the log file line shown above, you'll find that it's the seventh field that contains that value, meaning we can extract it like this:
$ head access_log | awk '{print $7}'
/favicon.ico
/0-blog-pics/itunes-pc-advanced-importing-prefs.png
/0-blog-pics/itunes-pc-importing-song.png
/styles-site.css
/individual-entry-javascript.js
/motorola_razr_v3c_and_mac_os_x_transfer_pictures_and_wallpaper.html
/Graphics/header-paper2.jpg
/Graphics/pinstripebg.gif
/0-blog-pics/bluetooth-razr-configured.png
/0-blog-pics/itunes-pc-library-sting.png
When you have a long list of data like this, you can figure out the most popular individual occurrences by sorting everything, then using the uniq command to figure out how often each line occurs. Then use sort again, this time to sort the data from that, prefaced with the largest numeric value to the smallest.
Here's an intermediate result to help you see what's happening:
$ awk '{print $7}' access_log | sort | uniq -c | head
535 /
26 //favicon.ico
6 //signup.cgi
1 /0-blog-pics/MVP-Combo_picture.jpg
2 /0-blog-pics/address-book-import.jpg
4 /0-blog-pics/adwords-psp-bids.png
28 /0-blog-pics/aim-congrats-account.png
28 /0-blog-pics/aim-create-screen-name.png
38 /0-blog-pics/aim-delete-screenname-mac.png
29 /0-blog-pics/aim-forget-password.png
All that's left is to sort it by most popular and axe all but the top few matches:
$ awk '{print $7}' access_log | sort | uniq -c | sort -rn | head
6176 /favicon.ico
5807 /styles-site.css
5733 /Graphics/header-paper2.jpg
5655 /Graphics/pinstripebg.gif
5512 /individual-entry-javascript.js
5458 /Graphics/marker-tray.gif
5366 /Graphics/help-button.jpg
5363 /Graphics/digman.gif
5359 /Graphics/delicious.gif
5323 /0-blog-pics/starbucks-hot-coffee.jpg
The first thing you'll notice is that this isn't pages but graphics. That's not a surprise, because just like most Web sites, my own AskDaveTaylor.com has graphics shared across all pages, making the graphics more frequently requested than any given HTML page.
Fortunately, we can force the results to be HTML pages by simply using the grep program to filter the final results of the filter sequence:
$ awk '{print $7}' access_log | sort | uniq -c | sort -rn
↪| grep "\.html" | head
446 /motorola_razr_v3c_and_mac_os_x_transfer_pictures_and_wallpaper.html
355 /how_to_create_new_screen_names_on_aol_america_online.html
346 /how_do_i_cancel_my_america_online_aol_account.html
293 /pc_to_sony_psp_how_do_i_download_music.html
206 /how_do_i_get_photos_and_music_onto_my_sony_psp.html
198 /how_do_i_get_my_wireless_wep_password_for_my_sony_psp.html
195 /cant_get_standalone_music_player_to_work_on_myspace.html
172 /convert_wma_from_windows_media_player_into_mp3_files.html
166 /sync_motorola_razr_v3c_with_windows_xp_via_bluetooth.html
123 /how_do_i_create_a_new_screen_name_in_aol_america_online_90.html
(Yes, yes, I know that the URLs on this site are ridiculously long!)
Now, finally, I can see that the articles about the Motorola RAR phone, AOL screen names and Sony PSP are the most popular articles on the site. Remember, this is a slice for only about six hours too, so the RAZR article is actually being requested an impressive once a minute or so. Popular indeed!
I'm going to stop here now that you've had a taste of how basic Linux commands can be combined to extract useful and interesting data from an Apache log file. Next month, we'll look at one more statistic: how much aggregate data we've transferred. Then, we'll start looking at how to build a shell script that does these sorts of calculations with ease.
Special Magazine Offer -- 2 Free Trial Issues!
Receive 2 free trial issues of Linux Journal as well as instant online access to current and past issues. There's NO RISK and NO OBLIGATION to buy. CLICK HERE for offer
Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.
Sorry, offer available in the US only. International orders, click here.
Subscribe now!
The Latest
Featured Videos
Linux Journal Gadget Guy, Shawn Powers, reviews the Flip Video Ultra, a small portable video camera, and shows us how easy it is to edit the video with Kino.
Thanks to our sponsor: Silicon Mechanics
Webcams are notorious for their lack of support under Linux. But thanks to GSPCA, many webcams now have functional V4L drivers. This tutorial covers the building, installation, and configuration of the GSPCA drivers, including how to adjust color balance and brightness directly at the kernel module level.
Recently Popular
| What do you use to run Windows applications on your Linux desktop? | Aug-19-08 |
| Music Education With Linux Sound Tools, Redux | Aug-18-08 |
| Building a Call Center with LTSP and Soft Phones | Aug-25-05 |
| Having Fun on ViewSurf | Jul-01-98 |
| Chapter 16: Ubuntu and Your iPod | Aug-30-06 |
| Why Python? | May-01-00 |
From the Magazine
September 2008, #173
Feeling a bit like a Thermian? Never give up, never surrender! Someday, you could go from underdog to top dog. Just take a look at a few of the underdogs we highlight in this issue: Mutt, djbdns, Nginix, Gentoo, Xara and the program voted mostly likely to fail just a few years back—Firefox. If Firefox not radical enough for you, check out Chef Marcel's column for some more alternatives. Having trouble mapping your program data to your relational database? If so, Rueven Lerner shows you some tricks in his At The Forge column.
Need to run GUI applications on your server in the next state? In his Paranoid Penguin column, Mick Bauer shows you how to do it securely. Kyle Rankin keeps hacking and slashing and shows you a few split screen secrets you may not be familiar with. Finally, we all know what happens next February, but only Doc knows what happens afterward.
Delicious
Digg
Reddit
Newsvine
Technorati






