Work the Shell - Analyzing Your Search Keywords

February 1st, 2007 by Dave Taylor in

Screen the unwanted results out of your access log searches.
Your rating: None

Last month, we started exploring how you can use a shell script to extract and analyze the HTTP_REFERER values out of your Web server log and identify the most common terms and phrases that people used to find your pages. Sounds useful, doesn't it?

The problem is, the script is more nuanced than it initially seems. Last month, we wrapped up with the following shell script:

#!/bin/sh

ACCESSLOG="/var/logs/httpd.logs/access_log"

grep 'google.com/search' $ACCESSLOG | \
  awk '{print $11}' | \
  cut -d\? -f2 | cut -d\& -f1 | \
  sed 's/+/ /g;s/%22/"/g;s/q=//' | \
  sort | \
  uniq -c | \
  sort -rn | \
  head -5

When I run this, here's what I see:

$ sh google-searches.sh
  94 hl=en
  18 client=safari
   6 client=firefox-a
   4 sourceid=navclient
   4 client=opera

That's weird, because it's not search terms, it's other variables that are included with search strings sent from sites like Google (hl=en says that you've constrained searches to English-language sites only, client=safari identifies the user's Web browser as Apple's Safari and so on).

Screening Out False Matches

The problem is revealed when we look at the first ten matches rather than only the first five:

$ sh google-searches.sh | head -10
  94 hl=en
  18 client=safari
   6 client=firefox-a
   4 sourceid=navclient
   4 client=opera
   3 wicked cool scripts
   3 hl=zh-CN
   2 num=100
   2 hs=wNy
   2 barbara nelson%2Bpurses

Ah, so we can see that there are two valid searches here, one for “wicked cool scripts” and one for “Barbara nelson%2Bpurses”. Not sure what the latter one's about, but it's useful and important to see. Fortunately, screening out the bogus matches is as simple as using grep to remove fields that include an equal sign: grep -v '='.

Rather than have that at the very end of the long pipe in the script, however, I'll place it immediately after the sed invocation to strip out false results as soon as possible in the pipeline to speed up the entire script. Now it looks like this:

grep 'google.com/search' $ACCESSLOG | \
  awk '{print $11}' | \
  cut -d\? -f2 | cut -d\& -f1 | \
  sed 's/+/ /g;s/%22/"/g;s/q=//' | \
  grep -v '=' | \
  sort | \
  uniq -c | \
  sort -rn

Notice that the sed statement itself strips out the name= part of the search (q=), so that it's not incorrectly matched in the new grep statement.

Now we have the results we want:

$ sh google-searches.sh | head -10
   3 wicked cool scripts
   2 barbara nelson%2Bpurses
   1 wsj%20password
   1 why did animal kingdom introduce expedition everest
   1 what makes a great speaker%3F
   1 university of phoenix center of writing excellence
   1 ubuntu x problem
   1 triboot osx ubuntu ydl
   1 the best dvd players
   1 symbol html heart

This site doesn't get a huge amount of traffic, so let's run the very same script against my far-busier AskDaveTaylor.com site. The results are more interesting:

$ sh google-searches.sh | head -10
   5 standalone player
   4 psp help
   4 create a myspace
   4 Documents and Settings"
   4 %24NtUninstall
   3 view myspace accounts that are set to private
   3 i cant hear music on runescape
   2 transfer files to psp
   2 sync v3 motorola mac
   2 running unix in windows xp

Much more interesting. Oh, and if you want to know how many searches you're exploring, it's another simple tweak to the script, an invocation of wc:

$ sh google-searches.sh | wc -l
     501

So out of 501 searches, the single-most common search is “standalone player”, which represents only five out of 500, or 1% of my search traffic.

Stripping Out Unwanted Characters

One more step before we walk away from this script for the month: let's get rid of the strange characters that have been carried over from the original URL encoding of the user's Web browser. What am I talking about? The %24, the closing double quote in Documents and Settings and the %2B in the earlier search for purses.

You can figure out all the mappings and convert everything as appropriate, but I'm lazy at the end of the day and will instead simply find all %xx sequences and replace them with a single space.

This sounds hard, but it's a perfect job for sed because it allows you to do pattern matching and then replace the matched material with whatever else you desire. Here's how I'd do that:

sed 's/%[0-9a-fA-F][0-9a-fA-F]/ /g'

Let's look at this closely before you panic. A set separated by square brackets is a set in regular expression terminology, so [0-9] will match any of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 0. It turns out that URL encoding uses hexadecimal, so not only can the values be 0-9, but they can also be A, B, C, D, E and F, in upper- or lowercase letters—hence 0-9 and a-f and A-F in the pattern. The overall pattern form is % followed by any of these possible values, followed by any of these possible values again. Now you can see the full pattern.

Finally, before we beat this completely into the ground, note that the bigger structure here in the sed statement is s/old/new/g, which replaces old with new throughout the line, whether it occurs once or 15 times.

We're not quite done yet, however, because we also need to strip the stray double quotes. Again, this is easily added to the sed statement:

sed 's/%[0-9a-fA-F][0-9a-fA-F]/ /g;s/"//g'

Here's the final script:

grep 'google.com/search' $ACCESSLOG | \
  awk '{print $11}' | \
  cut -d\? -f2 | cut -d\& -f1 | \
  sed 's/+/ /g;s/%22/"/g;s/q=//' | \
  sed 's/%[0-9a-fA-F][0-9a-fA-F]/ /g;s/"//g' | \
  grep -v '=' | sort | uniq -c | sort -rn

And, the final results:

$ sh google-searches.sh | head -15
   6  NtUninstall
   5 standalone player
   4 psp music
   4 psp help
   4 creat a myspace
   4 Documents and Settings
   3 view myspace accounts that are set to private
   3 i cant hear music on runescape
   2 transfer files to psp
   2 sync v3 motorola mac
   2 running unix in windows xp
   2 rss feed reader shell
   2 reinstall windows xp hp
   2 psp transfer music
   2 psp internet

Note that once we remove the stray material, things organize slightly differently (for example, here you can see that psp music is one of the top searches, but earlier we had different variations of psp music, and it didn't make it to a top search value).

Okay, enough torturing of the Apache log file. Let's wrap this up and we'll switch to something completely different in the next column! Suggestions? Please e-mail them to me!

Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com.

__________________________


Special Magazine Offer -- 2 Free Trial Issues!
Receive 2 free trial issues of Linux Journal as well as instant online access to current and past issues. There's NO RISK and NO OBLIGATION to buy. CLICK HERE for offer

Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.

Sorry, offer available in the US only. International orders, click here.

Post new comment

Please note that comments may not appear immediately, so there is no need to repost your comment.
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <i> <b>
  • Lines and paragraphs break automatically.

More information about formatting options

Featured Videos

The October 9, 2008 edition of Linux Journal Live! Associate Editor, Shawn Powers, and Kyle Rankin, "Hack and /" columnist and author of Knoppix Hacks, Linux Multimedia Hacks, Knoppix Pocket Reference and others, discuss Linux distributions.

From the Magazine

November 2008, #175

There aren't many numbers that put the US national debt to shame, but here's one: 1,100,000,000,000,000. What's that? That's how many floating-point operations per second the Roadrunner supercomputer at Las Alamos can perform. That's about 100 FLOPS per dollar of US debt (unfortunately, the debt is winning the second derivative race). Read the article about Roadrunner in this month's High Performance Computing issue of LJ.

Along with that, find out how to program the Cell processor and how to use CUDA with your NVIDIA GPU. Also in this issue: Mr HandS (aka Kyle Rankin) gives us a few tips on using Compiz, Chef Marcel shows you how to get blogging off your plate quicker, Mick Bauer talks about Samba security, Dan Sawyer interviews Cory Doctrow and Doc talks about how information technology can affect democracy and fix the national debt (just kidding about that last part). That and more for your reading pleasure in this month's Linux Journal.

Read this issue