The Many Paths to a Solution

on September 21, 2016

A project I'm involved with has made me think about how there are always many solution paths for any given problem in the Linux universe. For this other project, I wanted to cobble together a version of grep that let me specify proper regular expressions without having to worry about the -E flag and get a context for the matches too.

These are both popular expansions to grep, of course: the former demonstrated by both grep -E and the egrep shortcut, while the latter task is done with grep -C and, on some UNIX and Linux systems, wgrep.

But, there are a lot of different ways to create that particular functionality that don't involve relying on a modern version of grep; older versions might have the -E flag, but don't include support for contextualization.

So in this article, I thought it would be interesting to look at different ways to produce what I shall call wegrep, a version of grep that includes both the -C contextual window and the -E regular expression pattern support.

Wrapper, Maybe We Just Need a Wrapper

If you have the modern GNU grep, which you can ascertain by simply trying to use the -C flag, this all becomes easy:


$ grep -C
grep: option requires an argument -- C

There's a pretty gnarly usage statement after this, but if your version can understand the -C or its wordy sibling -context, you're in luck.

Enter a "wrapper", a simple script that changes the default behavior of a program. At its simplest, it actually can be a system alias, so this:


alias ls="/bin/ls -F"

is a sort of wrapper, ensuring that whenever I run the ls command, the -F flag is specified.

For this smarter version of grep, I simply could tell the user what flags to use or set specific flags with GREP_OPTIONS, an environment variable, but let's build out wegrep, as discussed.

For usage, it's going to be as simple as possible: command, pattern, source file. Like this:


wegrep '^Alice' wonderland.txt

This would search the file wonderland.txt for the regex "Alice", rooted to the beginning of a line.

Easily done:


grep=/usr/bin/grep
if [ $# -ne 2 ] ; then
  echo "Usage: wegrep [pattern] filename" ; exit 1
fi
$grep -C2 -n -E "$1" "$2"

I even added some error checking to ensure that the user specified the right number of parameters, with a simple error message to hide some of the complexity of the real grep command.

For a test file, I'm going to use the first four paragraphs of Lewis Carrol's immortal Alice in Wonderland, as downloaded from Project Gutenberg.

Here's the result of my first invocation:


$ sh wegrep '^Alice' wonderland.txt
11-Down the Rabbit-Hole
12-
13:Alice was beginning to get very tired of sitting by her
14-sister on the bank, and of having nothing to do: once
15-or twice she had peeped into the book her sister was
--
--
26-
27-There was nothing so very remarkable in that; nor did
28:Alice think it so very much out of the way to hear the
29-Rabbit say to itself, 'Oh dear! Oh dear! I shall be
30-late!' (when she thought it over afterwards, it

You can see that grep does a good job with this task, showing me two lines of context above and below each match, and denoting which line contains the match itself by having the : separate the line number from the content.

But what if your version of grep doesn't have support for the -C flag? What if you actually need to identify which lines match the pattern, then roll your own context display?

Building Your Own Context

Since grep is still available, and all but the most ancient of grep implementations support the -E flag to allow the user to specify a regular expression, the task can be broken into two parts: identify which lines match, then figure out a way to list lines (n-2)..n..(n+2), as shown in the above output.

The first task can be done surprisingly easily because grep has a handy -n flag that appends line numbers. With that, getting a list of which lines match the specified pattern is straightforward.

But, let's see what's output first:


$ grep -n -E '^Alice' wonderland.txt
13:Alice was beginning to get very tired of sitting by her
28:Alice think it so very much out of the way to hear the

Now it's a job for Superman! I mean, um, cut:


grep -n -E "$pattern" "$file | \
  cut -d: -f1
13
28

Let's switch to the other task of showing a range of lines centered on the specified line. You could do this with a tortured pairing of head and tail, but sed is a much better tool for the job this time.

In fact, sed makes it easy. Want to grab lines 12, 13 and 14? This'll do the trick:


sed '12,14p' wonderland.txt

Well, not quite. The problem is that the default behavior of sed is to echo every line it sees in addition to whatever the user specifies, so you'll end up with every line from wonderland.txt and additionally have lines 12–14 appear a second time as the statement is matched and executed (the p suffix means "print").

That's why if you're going to do anything with sed, it's critical to know its -n flag, which surpasses its desire to output every line it reads. Now here's a working command:


$ sed -n '12,14p' wonderland.txt

Alice was beginning to get very tired of sitting by her
sister on the bank, and of having nothing to do: once

Can you see how to chain these together? It all can be done in a simple for loop (particularly if you ignore error checking for now). But again, there's another small step required: the line count n prior and n subsequent to the matching line n need to be calculated. That's easy math:


before=$(( $match - $context ))
after=$(( $match + $context ))

Here context specifies whether you want 1, 2, 3 or more lines of context above and below the matching line.

Let's give this a whirl:


#!/bin/sh
# wegrep - grep with context and regular expressions
grep=/usr/bin/grep
sed=/usr/bin/sed
if [ $# -ne 2 ] ; then
  echo "Usage: wegrep [pattern] filename" ; exit 1
fi
for match in $($grep -n -E "$1" "$2" | cut -d: -f1)
do
  before=$(( $match - $context ))
  after=$(( $match + $context ))
  $sed -n '${before},${after}p' "$2"
done
exit 0

Except it turns out that there are two critical bugs in the above code, as is immediately apparent when you run your first test:


$ sh wegrep '^Alice' wonderland.txt

wegrep: line 14: 13:Alice -  : syntax error in expression
 ↪(error token is ":Alice -  ")

Can you see the first bug? Line 14 is the calculation for the variable before.

So what's wrong? You need to initialize context with a value, so the mathematical expression is essentially:


15 +

Which is correctly flagged as an error. Easily fixed.

The second bug is more subtle, however, but here's the clue when you run the script with context defined as 1 near the top of the script:


$ sh wegrep '^Alice' wonderland.txt
sed: 1: "${before},${after}p": unexpected EOF (pending }'s)
sed: 1: "${before},${after}p": unexpected EOF (pending }'s)

That's definitely odd. It's sed that's complaining, but what's wrong with the line that invokes sed?

Let's have another look at that line:


$sed -n '${before},${after}p' "$2"

Now can you see the error? It's a subtle and common problem in shell scripts: I'm using the wrong quotation marks. Remember, in a shell script, single quotation marks prevent the interpretation of variables. Switch it to double quotation marks, and everything now works great:


$ sh wegrep '^Alice' wonderland.txt

Alice was beginning to get very tired of sitting by her
sister on the bank, and of having nothing to do: once
There was nothing so very remarkable in that; nor did
Alice think it so very much out of the way to hear the
Rabbit say to itself, 'Oh dear! Oh dear! I shall be

Now another problem rears its head: how do you differentiate between blocks that have matched? Easy, add - - - - before and after each match by adding a few echo statements to the for loop:


for match in $($grep -n -E "$1" "$2" | cut -d: -f1)
do
  before=$(( $match - $context ))
   after=$(( $match + $context ))
  echo "-----"
  sed -n "${before},${after}p" "$2"
  echo "-----"
done

This works, but it's a bit clunky as output goes, although it pretty closely matches what modern grep does with the -C flag:


$ sh wegrep '^Alice' wonderland.txt
-----

Alice was beginning to get very tired of sitting by her
sister on the bank, and of having nothing to do: once
-----
-----
There was nothing so very remarkable in that; nor did
Alice think it so very much out of the way to hear the
Rabbit say to itself, 'Oh dear! Oh dear! I shall be
-----

As a purist, I'd much rather have one dashed line between output blocks, one before the first match and one after the last, with no doubling of lines.

That's not hard to do, and there's a second task of adding back line numbers and ideally denoting which line has the match to the regular expression. But I'm out of room, so those tasks will have to wait until another day.

Dave Taylor has been hacking shell scripts on UNIX and Linux systems for a really long time. He's the author of Learning Unix for Mac OS X and Wicked Cool Shell Scripts. You can find him on Twitter as @DaveTaylor, and you can reach him through his tech Q&A site: Ask Dave Taylor.