Introduction to Gawk
How often have you thought to yourself, “I should write a program to do that!” only to realize that you will have to write more than just the code needed to solve the problem at hand? Your program will probably need to get the names of data files from the command line, open and read these files, and allocate and manage memory for data storage. This programming overhead can be a lot of effort to write and debug. To make this programming task even less appealing, what if you need this program “right now” and it may be used only once or twice? Does writing this program still seem worth all the effort? If you are using one of the more traditional languages, such as C or C++, perhaps not. However, the awk programming language may be just the right tool for writing the programs you need while minimizing the programming overhead.
gawk, the GNU version of the powerful awk programming language, lets you concentrate on writing the code to solve the problem at hand without worrying about all the overhead required to actually make your program do its job. gawk offers many features designed to help you quickly write useful and powerful programs. With features such as pattern-matching, associative arrays, automatic handling of command-line argument files, and no need for variable declarations, gawk is able to free you from many of the tiresome details that often get in the way of getting the job done.
gawk is suitable for a wide range of applications, from simple, one-line applications to complex applications that will be used on a regular basis. gawk is also a simpler, easier to use alternative to Perl. Although Perl programs will run faster than comparable gawk programs, the syntax and features of gawk are (in my opinion) easier to read and tend not to become quite so obfuscated.
C programmers will find that parts of gawk are already quite familiar to them. In many ways, the syntax of gawk looks very much like the syntax of C, with constructs such as pre- and post-increment and decrement operators, nestable if-else blocks, for loops which look exactly like those in C—even the familiar { and } defining sections of code. This close similarity to C is not such a surprise when you consider that one of the originators of the awk programming language, Brian Kernighan, was also one of the originators of C.
However, beyond this similarity in syntax, awk is a language quite unlike the traditional languages in most common use today.
In this article I will describe the more basic features of working with gawk, the GNU version of awk. There will be many parts of this language that I cannot cover here—for these you will need to consult one of the sources listed in the reference section at the end. Although I will be describing gawk, the features discussed here should be applicable to most versions of the awk programming language. As such, the names gawk and awk are often used interchangeably.
In keeping with the tradition set by countless authors writing about a programming language, here is the ever-popular “Hello World” program written in awk:
BEGIN { print "Hello World" }
Before I explain how to run this program, I will describe how a gawk program, or script, works.
A major difference between gawk and most other languages is that gawk is a pattern-matching language. That is, gawk scans its input looking for patterns which have been specified in the gawk program, and executes the block of gawk code associated with that pattern. A gawk program, or script, consists of one or more patterns which the programmer wishes to match against each line of input, and the corresponding action blocks (enclosed between { and }) which are to be executed when that pattern is found in an input line. So a gawk program has the form:
pattern1 { action1 } pattern2 { action2 } . . . patternN { actionN }
These patterns, which can consist of a simple expression, a regular expression, a combination of patterns, or even an empty pattern, can be as simple or as complex as needed. To print all lines in a file which contain the word “Linux”, the pattern is simply defined as /Linux/ and the action block is {print}. Thus, the complete gawk program can be written as:
/Linux/ { print }
Action blocks consist of one or more gawk statements enclosed between { and }. In this simple example, the print statement will print everything on each line which contains the pattern “Linux”. However, this program will also match such words as “LinuxKernel”--the pattern does not have to be a discrete word. Also, since pattern matching is case-sensitive by default, it will not match the pattern “linux”.
If you need to match both upper and lower case, the pattern can be changed to allow for this—it just becomes a more complex pattern. If you wanted the pattern to match both “linux” and “Linux”, you could write the pattern as /[Ll]inux/. In this case, you are telling gawk to look for groups of characters that begin with any of the characters enclosed in the square brackets (here, either an upper or lower case “L”) followed by the lowercase letters “inux”. Other options for dealing with case sensitivity are to use the built-in functions tolower() or toupper() to change the case of the input line (or just parts of the line) before the pattern matching takes place, or you can set the built in variable IGNORECASE (in awk, built in variables are always written in upper case) to any non-zero value at the start of your program.
Patterns in gawk can be as simple or as complex as needed to match the desired item in the input line. If you do not specify a pattern, the action block will be executed for every line of input. This is known as an empty pattern. So if you do not explicitly put a pattern into your program, gawk treats the lack of a pattern as a pattern that will match everything in the input.
Alternatively, if you specify a pattern but no action, gawk will provide a default action—namely {print}--for you. So the simple program above can be rewritten as /Linux/, although it is usually better to define an action explicitly, since this results in more readable code.
gawk also defines several special patterns which do not match any input at all, the most commonly used being BEGIN and END. The action block associated with BEGIN will be executed only once, before gawk starts to read the input files, and allow you to take care of any setup and initialization details that may be needed. The action block for the END pattern will be executed after the processing of all input has been completed and is useful for printing any final results from your program. The BEGIN and END patterns are optional—you include them only when there is a need for them.
However, if you wish to write a gawk script that takes no input at all—say for example, the ever-popular “Hello World” program that was shown earlier—your gawk statements must be enclosed in the action block for the BEGIN pattern. Otherwise, gawk will see them as part of the main input loop block (described next) and wait for some input (or a Control-D) before printing—probably not what you want to happen in this case.
Work in almost any programming language and you will have to write code to get the names of any files from the command line, open these files, and read their contents. For most file access, gawk let you skip these steps entirely. If you pass one or more file names on the command line, after executing the code in the BEGIN block (if present), gawk will automatically get the name from the command line, open a file, read its contents line-by-line, try to match any pattern you have defined against these lines, close the file when it is finished, and move onto the next file listed. If the input is coming from standard input (i.e., you are piping the output of another program to your gawk program), the input process is equally transparent. However, if you find that you need to handle this file input in some different manner, gawk provides you with all the tools necessary to do this. But for most of the file handling you will need, it is better to let gawk's input loop do the work for you.
Now that we have seen how a gawk program works the next step is to see how to make your program run. With gawk on Linux, we have three ways to do this. For those truly quick-and-dirty tasks, an entire gawk program can be written and executed on the command line, although this is really only practical for very small programs. Using our simple example from above, we can run it with the command:
gawk '/Linux/ {print}' file.txt
When running a gawk script from the command line, you must enclose the awk statements in single quotes and list any data files after the closing quote. If you need to use more that one gawk statement in an action block, simply separate each statement using the semicolon. For example, if you wanted to print each line that contained “Linux” and keep a count of how many input lines contain the pattern /Linux/ you could write
gawk '/Linux/{ print; count=count+1 } END { print count " lines" }' file.txt
You can list any number of data files on the command line and gawk will automatically open and read them, looking for any lines which match the pattern defined.
You can also use your favourite editor to write your gawk program and pass the name of the file to gawk using the -f option to tell gawk to try to execute the contents of that file. (For convenience, I like to use the extension “.awk” on these files, although this is not necessary.) So if the file linux.awk contains the pattern-action block:
/Linux/ { print count = count + 1 } END { print count "lines found." }
It can be executed by the command:
gawk -f linux.awk file.txt anotherfile.txt
Under Linux (and other versions of Unix) there is another, easier way to run your gawk program—simply put the line
#!/usr/bin/gawk -f
at the top of the program to indicate the path to the gawk interpreter. Make the file executable using the chmod command--chmod +x linux.awk. Then we can execute the gawk program by typing its name and any parameters. (Note: you will need to check the actual location of the gawk interpreter on your system and put this path in the first line.)
Another powerful and time saving feature of gawk is its ability to automatically separate each input line into fields, each referred to by number. The entire line is referred to as $0 and each field within the current line is $1, $2, and so forth. So if the input line is This is a line,
$0 = This is a line $1 = This $2 = is $3 = a $4 = line
Likewise, the built-in variable NF, which contains the number of fields in the current input line, will be set to 4. If you try to refer to fields beyond NF, their value will be NULL. Another built-in variable, NR, contains the total number of input lines that awk has read so far.
As an example of the use of these fields, if you needed to take the contents of a file and print it out, one word per line (useful if you want to pipe each word in a file to a spell checker), simply run this script:
{ for (i=1;i<=NF;i++) print $i }
To separate the line into fields, gawk uses another built in variable, FS (for “field separator”). The default value of FS is " " so fields are separated by white space: any number of consecutive spaces or tabs. Setting FS to any other character means that fields are separated by exactly one occurence of that character. So if there are two occurences of that character in a row, gawk will present you with an empty field.
To get a better idea of how FS works with input lines, suppose we wanted to print the full names of all users listed in /etc/passwd, where the fields are separated by :. You would need to set FS=":". If the file names.awk contains the following gawk statements:
{ FS=":" print $5 }
and you run it with gawk -f names.awk /etc/passwd, the program will separate each line into fields and print field 5, which in this case is the full name of the user. However, the line FS=":" will be executed for each line in the data file—hardly efficient. If you are setting FS, it is usually best to make use of the BEGIN pattern, which is run only once, and rewrite our program as:
BEGIN { FS=":" } { print $5 }
Now the line FS=":" will be executed only once, before gawk starts to read the file /etc/passwd.
This automatic splitting of input lines into fields can be used to make patterns more powerful by allowing you to restrict the pattern matching to a single field. Still using /etc/passwd as an example, if you wanted to see the full name of all users on your Linux system (field 5 of /etc/passwd) who prefer to use csh rather than bash as their chosen shell (field 7 of /etc/passwd), you could run the following gawk program:
# (in awk, anything after the # is a comment) # change the field separator so we can separate # each line of the file /etc/passwd and access # the name and shell fields BEGIN { FS=":" } $7 ~ /csh/ {print $5}
The gawk operator ~ means “matches”, so we are testing if the contents of the seven field match csh. If the match is found, then the action block will be executed and the name will be printed. Also, remember that since patterns match substrings, this will also print the names of tcsh users. If a particular input line does not contain a seven field, no problem—no match will be found for this pattern. Similarly, the pattern $7 !~ /bash/ will run its action block if the contents of the seven field do not match the pattern bash. (Unlike the match operator, this pattern will match if $7 does not exist in the current input line. Recall that if we try to access a field beyond NF, its value will be NULL, and NULL does not match /bash/, so the action block for this pattern will be executed.)
To further demonstrate the power of fields and pattern matching, let's go back to the problem of dealing with case sensitivity in pattern matching. By using a built-in function, toupper() or tolower(), we can change the case of all or selected parts of the input line. Suppose we have a data file containing names (the first field) and phone numbers (the second field), but some names are all lower case, some are all upper case and some are mixed. We could simplify the matching by modifing the pattern to:
toupper($1) ~ /LINUX/ {print $0}
This will cause the name in field 1 to be converted to upper case before awk tries to match it against the pattern. No other parts of the input line will compared against the pattern.
The control statements in the gawk language closely resemble those found in C, thus making gawk more easily written and understood by C programmers. gawk contains the pre- and post-increment and decrement operators ++ and --, as well as an if-else statement that looks very much like the one found in C. Also multi-line blocks of code are grouped within { and }. Even the for loop seems to have been taken right out of a C programming book.
This allows you to “mix and match” code which takes advantage of gawk's pattern matching with code that uses more traditional control structures, so if patterns are not sufficient for your task (or you are not sure how to use them to accomplish your task) you can use standard programming techniques as well. Conventional programming with gawk is not covered here; the gawk info page (run info gawk) documents this well, and the goal of this article is to demonstrate gawk's distinguishing features.
Another timesaving feature of gawk is that there is no need to declare a variable before using it. A variable can be a string, an integer, or a floating point number depending on the value assigned to it. gawk will handle conversions for you automatically. As a result, an expression such as total = 2 + "3" is valid and will give the expected result, 5. To make your job even easier, gawk will initialize each variable when it is used for the first time, setting it to 0 for an integer or "" for an integer or a string, respectively. This takes away any worries about uninitialized variables.
gawk also carries this ease of use of variables to arrays. There is no need to declare an array before using it, or even to specify a maximum size for that array. To create an array, simply use it and gawk will allocate the required space for you. As you add more data to the array, its size will automatically expand to accomodate it.
However, the array indices in gawk differ from those in languages such as C, in that gawk indices are associative, rather than numeric.
In an associative array, the array index is associated with the value assigned to it. This means that you can write expressions such as theArray["text"]="this is a line". If you wish, you can still use an integer as the index, as in theArray[50] = "some value". It is also possible to use a mixture of strings, integers, and even floating point numbers as indices in the same array, since gawk treats all indices as strings. So the expression theArray[50] = "some value" is equivalent to theArray["50"] = "some value".
To make working with arrays as easy as possible, awk provides the programmer with several powerful array operators. For example, to test whether a value is present in an array you can use the in operator. For example:
if (someValue in theArray) { # action to take if somevalue is in theArray } else { # an alternate action if it is not present }
To perform an action on all values in an array, such as printing each value contained in it, you can use a variation of the for loop, for example:
for (i in theArray) print i
gawk sets the variable i to the next value in theArray on each pass through the loop and then prints it.
To remove a value from an array, simply use the delete operator. For example, delete theArray["word"] will remove "word" from theArray.
With associative arrays, you can quickly build powerful applications without concern for the traditional overhead of declaring the array, allocating the memory, or searching for an item in the array. And size is not a factor—the following gawk program easily read and stored all 45,101 words from the file /usr/dict/words into an associative array (in this case, using the number of the current line as the array index):
{ words[NR] = $1 } END { print NR " words read" }
Such a task would be much more involved in C, as you would need to determine how you want to store all the words (An array declared with a size sufficient for all 45101 character strings? A linked list? A binary tree?). You may argue that with C you are free to choose a data structure which will provide much more efficient memory allocation and faster access speed than is possible with an associative array. While this may be true, it does not tell the whole story—it will certainly take you some time to write and test this C program (and very likely, more time to debug it). The power of the associative arrays and the simple, transparent memory management built into gawk means that you are free from dealing with such concerns—just tell gawk what you want and it handles much of the hard work behind the scenes.
It seems impossible to have such ease of use together with speed; there must be a trade-off. This is one area in which gawk suffers—run-time performance. However, this is not to say that gawk is a terribly slow language. Since gawk is interpreted rather than compiled, it cannot compete with compiled languages for speed of execution. (It also is somewhat slower than a comparable program written in Perl.) However, if your main concern is getting a working program written as quickly as possible, you probably do not want to wrestle with C or C++ for a week to perfect the most efficient algorithm. By trading off the speed advantages and control features of C (or another compiled language) for ease of use, gawk lets you get the job done quickly and relatively painlessly.
If, however, execution speed is a critical point, gawk makes an excellent tool for implementing and testing a prototype before you start to code in C. And when the prototype is complete you may find that the gawk version is fast enough to meet your needs.
gawk offers the programmer a simple, somewhat C-like syntax, automatic file handling, associative arrays, and powerful pattern matching—features which can help you to create a program much more quickly than with a more traditional language. gawk also has many other useful and powerful features such as user-defined functions, recursion, many built-in functions, regular expressions, multidimensional arrays, formatted output using printf and sprintf, even the ability to set variables on the command line. These features are beyond the scope of this article. Without doubt, gawk's interpreter will produce a slower running final product than a C compiler, or even a Perl interpreter. But this slower execution speed (it certainly is not slow!) is more than compensated for by the speed and ease of program development and testing. When you need a program to perform a task and you need it right now, whether it is a quick-and-dirty, use-once program or a program that will be getting plenty of use, gawk may prove to be the right language for the task.
Ian Gordon (iang@hyprotech.com) is a support programmer at Hyprotech Ltd. in Calgary, Alberta. He discovered the joys of Linux 15 months ago, a discovery which has taken up much of his free time.