The Many Paths to a Solution
A project I'm involved with has made me think about how there are
always many solution paths for any given problem in the Linux universe. For
this other project, I wanted to cobble together a version of
grep
that let
me specify proper regular expressions without having to worry about the
-E
flag and get a context for the matches too.
These are both popular expansions to grep
, of course: the former
demonstrated by both grep -E
and the
egrep
shortcut, while the latter task
is done with grep -C
and, on some UNIX and Linux
systems, wgrep
.
But, there are a lot of different ways to create that particular
functionality that don't involve relying on a modern version of
grep
;
older versions might have the -E
flag, but don't include
support for contextualization.
So in this article, I thought it would be interesting to look at different ways to
produce what I shall call wegrep
, a version of
grep
that includes both the
-C
contextual window and the
-E
regular expression pattern support.
If you have the modern GNU grep
, which you can ascertain by simply trying
to use the -C
flag, this all becomes easy:
$ grep -C
grep: option requires an argument -- C
There's a pretty gnarly usage statement after this, but if your version
can understand the -C
or its wordy sibling
-context
, you're in
luck.
Enter a "wrapper", a simple script that changes the default behavior of a program. At its simplest, it actually can be a system alias, so this:
alias ls="/bin/ls -F"
is a sort of wrapper, ensuring that whenever I run the
ls
command, the
-F
flag is specified.
For this smarter version of grep
, I simply could tell the user what flags to
use or set specific flags with GREP_OPTIONS
, an environment variable, but
let's build out wegrep
, as discussed.
For usage, it's going to be as simple as possible: command, pattern, source file. Like this:
wegrep '^Alice' wonderland.txt
This would search the file wonderland.txt for the regex "Alice", rooted to the beginning of a line.
Easily done:
grep=/usr/bin/grep
if [ $# -ne 2 ] ; then
echo "Usage: wegrep [pattern] filename" ; exit 1
fi
$grep -C2 -n -E "$1" "$2"
I even added some error checking to ensure that the user specified the
right number of parameters, with a simple error message to hide some of the
complexity of the real grep
command.
For a test file, I'm going to use the first four paragraphs of Lewis Carrol's immortal Alice in Wonderland, as downloaded from Project Gutenberg.
Here's the result of my first invocation:
$ sh wegrep '^Alice' wonderland.txt
11-Down the Rabbit-Hole
12-
13:Alice was beginning to get very tired of sitting by her
14-sister on the bank, and of having nothing to do: once
15-or twice she had peeped into the book her sister was
--
--
26-
27-There was nothing so very remarkable in that; nor did
28:Alice think it so very much out of the way to hear the
29-Rabbit say to itself, 'Oh dear! Oh dear! I shall be
30-late!' (when she thought it over afterwards, it
You can see that grep
does a good job with this
task, showing me two lines of
context above and below each match, and denoting which line contains the
match itself by having the : separate the line number from the
content.
But what if your version of grep
doesn't have support for the -C flag?
What if you actually need to identify which lines match the pattern, then
roll your own context display?
Since grep
is still available, and all but the most
ancient of grep
implementations support the -E
flag to allow the user to specify a regular
expression, the task can be broken into two parts: identify which lines
match, then figure out a way to list lines
(n-2)..n..(n+2)
, as shown in the
above output.
The first task can be done surprisingly easily because
grep
has a handy -n
flag that appends line numbers. With that, getting a list of which lines
match the specified pattern is straightforward.
But, let's see what's output first:
$ grep -n -E '^Alice' wonderland.txt
13:Alice was beginning to get very tired of sitting by her
28:Alice think it so very much out of the way to hear the
Now it's a job for Superman! I mean, um, cut
:
grep -n -E "$pattern" "$file | \
cut -d: -f1
13
28
Let's switch to the other task of showing a range of lines centered on
the specified line. You could do this with a tortured pairing of
head
and
tail
, but sed
is a much
better tool for the job this time.
In fact, sed
makes it easy. Want to grab lines 12, 13 and 14? This'll
do the trick:
sed '12,14p' wonderland.txt
Well, not quite. The problem is that the default behavior of
sed
is to echo
every line it sees in addition to whatever the user specifies, so you'll
end up with every line from wonderland.txt and additionally have lines
12–14 appear a second time as the statement is matched and executed (the
p
suffix means "print").
That's why if you're going to do anything with sed
, it's
critical to know its -n
flag, which surpasses its desire to output every
line it reads. Now here's a working command:
$ sed -n '12,14p' wonderland.txt
Alice was beginning to get very tired of sitting by her
sister on the bank, and of having nothing to do: once
Can you see how to chain these together? It all can be done in a simple for loop (particularly if you ignore error checking for now). But again, there's another small step required: the line count n prior and n subsequent to the matching line n need to be calculated. That's easy math:
before=$(( $match - $context ))
after=$(( $match + $context ))
Here context
specifies whether you want 1, 2, 3 or more lines of
context above and below the matching line.
Let's give this a whirl:
#!/bin/sh
# wegrep - grep with context and regular expressions
grep=/usr/bin/grep
sed=/usr/bin/sed
if [ $# -ne 2 ] ; then
echo "Usage: wegrep [pattern] filename" ; exit 1
fi
for match in $($grep -n -E "$1" "$2" | cut -d: -f1)
do
before=$(( $match - $context ))
after=$(( $match + $context ))
$sed -n '${before},${after}p' "$2"
done
exit 0
Except it turns out that there are two critical bugs in the above code, as is immediately apparent when you run your first test:
$ sh wegrep '^Alice' wonderland.txt
wegrep: line 14: 13:Alice - : syntax error in expression
↪(error token is ":Alice - ")
Can you see the first bug? Line 14 is the calculation for the variable
before
.
So what's wrong? You need to initialize context
with a
value, so the mathematical expression is essentially:
15 +
Which is correctly flagged as an error. Easily fixed.
The second bug is more subtle, however, but here's the clue when you run
the script with context
defined as 1 near the top of the script:
$ sh wegrep '^Alice' wonderland.txt
sed: 1: "${before},${after}p": unexpected EOF (pending }'s)
sed: 1: "${before},${after}p": unexpected EOF (pending }'s)
That's definitely odd. It's sed
that's complaining, but
what's wrong with the line that invokes sed
?
Let's have another look at that line:
$sed -n '${before},${after}p' "$2"
Now can you see the error? It's a subtle and common problem in shell scripts: I'm using the wrong quotation marks. Remember, in a shell script, single quotation marks prevent the interpretation of variables. Switch it to double quotation marks, and everything now works great:
$ sh wegrep '^Alice' wonderland.txt
Alice was beginning to get very tired of sitting by her
sister on the bank, and of having nothing to do: once
There was nothing so very remarkable in that; nor did
Alice think it so very much out of the way to hear the
Rabbit say to itself, 'Oh dear! Oh dear! I shall be
Now another problem rears its head: how do you differentiate between blocks
that have matched? Easy, add - - - -
before and after each match
by adding a few echo statements to the for loop:
for match in $($grep -n -E "$1" "$2" | cut -d: -f1)
do
before=$(( $match - $context ))
after=$(( $match + $context ))
echo "-----"
sed -n "${before},${after}p" "$2"
echo "-----"
done
This works, but it's a bit clunky as output goes, although it pretty
closely matches what modern grep
does with the
-C
flag:
$ sh wegrep '^Alice' wonderland.txt
-----
Alice was beginning to get very tired of sitting by her
sister on the bank, and of having nothing to do: once
-----
-----
There was nothing so very remarkable in that; nor did
Alice think it so very much out of the way to hear the
Rabbit say to itself, 'Oh dear! Oh dear! I shall be
-----
As a purist, I'd much rather have one dashed line between output blocks, one before the first match and one after the last, with no doubling of lines.
That's not hard to do, and there's a second task of adding back line numbers and ideally denoting which line has the match to the regular expression. But I'm out of room, so those tasks will have to wait until another day.