The awk Utility
Partly tool and partly programming language, awk has had a reputation of being overly complex and difficult to use. This column demonstrates its usefulness without getting hung up on the complexity.
Scripting languages such as the UNIX shell and specialty tools like awk and sed have been a standard part of the UNIX landscape since it became commercially available. In 1982, “real programmers” used C for everything. Tools such as sed and awk were viewed as slow, large programs that “hogged” the CPU. Even applications that performed structured data processing and report-generation tasks were implemented in fast, compiled languages like C.
Part of my motivation for writing this article comes from observing that, even today, most system administrators and developers are either uninformed about or intimidated by utilities like awk and sed. As a result, tasks that should be automated continue to be performed manually (or not at all), or duller tools are used instead.
Admittedly, both awk and sed are rather peculiar tools/languages. Both recognize traditional UNIX “regular expressions”—powerful, but not trivial to learn. Both tools seem to offer too many features—quite often providing several ways of performing the same task. Therefore, mastering all the features of awk and sed and confidently applying them can take awhile—or so it may seem. First impressions notwithstanding, you can quickly and effectively apply these tools once you understand their general usefulness and become familiar with a subset of their most useful features. My intent is to provide you with enough information and example code for getting jump-started with awk. You can read about sed in April's “Take Command: Good Ol' sed” by Hans de Vreught.
sed and awk are two of the most productive tools I have ever used. I rely on them quite heavily to implement a wide range of tasks, the implementation of which would take considerably longer using other tools/languages.
I will assume you have heard of or worked with some of the more significant sub-systems of Linux and that you have an understanding of how to use the basic features of the shell command line, such as file I/O and piping. Familiarity with a standard editor such as vi and a working knowledge of regular expressions would also be useful. Many Linux commands, including grep, awk and sed, accept regular expressions as part of their invocation, so you should at least learn the basics.
A Word about Regular Expressions
My coverage of the awk tool is limited to an introductory foundation. Many advanced features are offered by awk (gawk and nawk) but will not be covered here.
The meaning behind the name of this tool is not terribly interesting, but I'll include an explanation to solve the mystery of its rather uncommon name. awk was named after its original developers: Aho, Weinberger and Kernighan. awk scripts are readily portable across all flavors of UNIX/Linux.
awk is typically engaged to reprocess structured textual data. It can easily be used as part of a command-line filter sequence, since by default, it expects its input from the standard input stream (stdin) and writes its output to the standard output stream (stdout). In some of the most effective applications, awk is used in concert with sed—complementing each other's strengths.
The following shell command scans the contents of a file called oldfile, changing all occurrences of the word “UNIX” to “Linux” and writing the resulting text to a file called newfile.
$ awk '{gsub(/UNIX/, "Linux"); print}' oldfile \>\ newfile
Obviously, awk does not change the contents of the original file. That is, it behaves as a stream editor should—passively writing new content to an output stream. This example barely demonstrates anything useful, but it does show that simple tasks can be implemented simply. Although awk is commonly invoked from a parent shell script covering a grander scope, it can be (and often is) used directly from the command line to perform a single straightforward task as just shown.
Although awk has been employed to perform a variety of tasks, it is most suitable for parsing and manipulating textual data and generating formatted reports. A typical (and tangible) example application for awk is one where a lengthy system log file needs to be examined and summarized into a formatted report. Consider the log files generated by the sendmail daemon or the uucp program. These files are typically lengthy, boring and generally hard on a system administrator's eyes. An awk script can be employed to parse each entry, produce a set of category counts and flag those entries which represent suspicious activity.
The most significant characteristics of awk are:
It views its input as a set of records and fields.
It offers programming constructs that are similar (but not identical) to the C language.
It offers built-in functions and variables.
Its variables are typeless.
It performs pattern matching through regular expressions.
awk scripts can be very expressive and are often several pages in length. The awk language offers the typical programming constructs expected in any high-level programming language. It has been described as an interpreted version of the C language, but although there are similarities, awk differs from C both semantically and syntactically. A host of default behaviors, loose data typing, and built-in functions and variables make awk preferable to C for quick-prototyping tasks.
At least two distinct methods can be used to invoke awk. The first includes the awk script in-line within the command line. The second allows the programmer to save the awk script to a file and refer to it on the command line.
Examine the two invocation styles below, formatted in the typical man page notation.
awk '{ awk -Fc -f script_file [data-file-list ...]
Notice that data-file-list is always optional, since by default awk reads from standard input. I almost always use the second invocation method, since most of my awk scripts are more than 10 lines. As a general rule, it is a good idea to maintain your awk script in a separate file if it is of any significant size. This is a more organized way to maintain source code and allows for separate revision control and readable comment statements. The -F option controls the input field-delimiter character, which I will cover in detail later. The following are all valid examples of invoking awk at a shell prompt:
$ ls -l | awk -f
$ awk -f
$ awk -F: '{ print $2 }'
$ awk {'print'} input_file
As you will see through examples, awk programming is a process of
overriding levels of default actions. The last example above is
perhaps the simplest example of invoking awk; it prints each line
in the given input file to standard output.
If you acquire a thorough understanding of awk's behavior, the complexity of the language syntax won't appear to be so great. To provide a smooth introduction, I will avoid examples that take advantage of regular expressions (see “A Word About Regular Expressions”). awk offers a very well-defined and useful process model. The programmer is able to define groups of actions to occur in sequence before any data processing is performed, while each input record is processed, and after all input data has been processed.
With these groups in mind, the basic syntactical format of any awk script is as follows:
BEGIN { } { } END { }
The code within the BEGIN section is executed by awk before it examines any of its input data. This section can be used to initialize user-defined variables or change the value of a built-in variable. If your script is generating a formatted report, you might want to print out a heading in this section. The code within the END section is executed by awk after all of its input data has been processed. This section would obviously be suitable for printing report trailers or summaries calculated on the input data. Both the END and BEGIN sections are optional in an awk script. The middle section is the implicit main input loop of an awk script. This section must contain at least one explicit action. That action can be as simple as an unconditional print statement. The code in this section is executed each time a record is encountered in the input data set. By default, a record delimiter is a line-feed character. So by default, a record is a single line of text. The programmer can redefine the default value of the record delimiter.
The following input data text will be assumed in each of the following examples. The content of the data is somewhat silly, but serves the exercise well. You can imagine it representing a produce inventory; each line defines a produce category, a particular item and an item count.
fruit: oranges 10 fruit: peaches 11 fruit: plums 11 vegetable: cucumbers 8 vegetable: carrots fruit: tomatoes 2
We will start off very simply and quickly work into something non-trivial. Notice that I make a habit of always defining each of the three sections, even if the optional sections are stubbed out. This serves as a good visual placeholder and reminds the programmer of the entire process model even if certain sections are not currently useful. Be aware that each of the examples could be collapsed into shorter scripts without any loss of functionality. My intent here is to demonstrate as many awk features as possible through these few examples.
Look at the example script in Listing 1 and try to relate it to its output:
fruit: oranges 10 fruit: peaches 11 fruit: plums 11 fruit: tomatoes 2
By default, an input record is a line-feed terminated section of text, so if the input contains six lines, the implicit main loop marked by the # (1) comment executes six times. The awk source-code comments are specified with a # character—the interpreter ignores characters from the # to the end of the line (same comment style as the UNIX shell). The built-in variable $0 always contains the entire current record value (see built-in variable table below). The line below the (1) marker checks to see if the current input record is an empty line. If it is, awk goes on to read the next input record. Each field within a record is assigned to an ordered variable—$1 through $N where N is equal to the number of fields in the current record. What determines a field? Well, the default field separator is any “white space”—a space or tab character. The field separator character can be redefined. The line below the # (2) comment will print out the entire record if the first field is set to fruit:. So, when looking at the output produced by Script 1, all lines of type fruit are displayed.
Take a look at the example script in Listing 2 and try to relate it to its output below. The only noticeable enhancement is the data summary at the end—stating how may of the total units were of type fruit.
fruit: oranges 10 fruit: peaches 11 fruit: plums 11 fruit: tomatoes 2 4 out of 5 entries were of type fruit:.
This time, we made use of the two optional BEGIN and END sections of the awk script. The group of statements preceded by the # (1) comment initialize some programmer-defined variables: FCOUNT, COUNT and TYPE—representing the number of fruit: records encountered, the total number of records and the produce-category name. Notice that the line preceded by the # (3) unconditionally increments the record counter (also note that syntax is borrowed from the C language). The section of code preceded by the # (4) comment now references the TYPE variable instead of a literal string, and increments the FCOUNT variable. The next section of code makes use of the printf built-in function (works just as the C-library printf does, but differs a bit syntactically) to print out a sub-count and a total count.
Look at the example script in Listing 3 and try to relate it to its output. Notice that the only records displayed are those which were flagged as an error and those indicating a supply shortage. The summarization at the end of the output now includes additional information. Output from Listing 3:
Parsing inventory file "input_data" Bad data encountered: vegetable: carrots Short on tomatoes: 2 left 4 out of 5 entries were of type Fruit. 1 out of 5 entries were of type Vegetable. 0 out of 5 entries were of type Other. 1 out of 5 entries were flagged as bad data. 1 out of 5 entries were flagged in short supply
In this third example, we make further use of the two optional BEGIN and END sections. Once again, the BEGIN section initializes some programmer-defined variables. It also prints out a heading that indicates the name of the input file (the built-in variable FILENAME is referenced). Notice the new code section preceded by the # (3) comment. The NF variable is a built-in that always contains the number of fields contained in the current record. Since white space is still our field delimiter, we would always expect three fields. This code section flags and displays a record that is deemed bad data. Also, a counter maintaining the number of errors is incremented. Since records deemed invalid are useless, the program then goes on to process the next input record. The code section preceded by the # (5) comment was altered to maintain additional counts based on the produce category type.
Now let's assume a system administrator is asked to determine the proportions certain shell interpreters are being used with the choices of the standard Bourne Shell, the Korn Shell and the C Shell. The script will provide a breakdown of usage by total count and percentages and flag the instances where a login shell was not applicable or not assigned to a system user. Examine the script in Listing 4—it satisfies our requirement. Relate the code to its output in Listing 5.
The first thing worth noticing in the Listing 4 script is the assignment to the built-in variable FS—the input field delimiter. Entries in the /etc/passwd file are made up of colon separated fields. Field 7 indicates which program (shell) is run on behalf of that user at login time. Entries with an empty field 7 are printed out, then the summary report is printed.
Thus far, we have reviewed awk's behavior through several small examples of code. The features demonstrated provide a working foundation. You have seen the execution flow of an awk process. You have seen built-in and user-defined variables being manipulated. And you have seen a few built-in awk functions applied. As with any high-level language, one can be very creative with awk. Once you get comfortable, you will want to put it to more sophisticated use. Most Linux systems today offer the features of nawk (new awk), which was developed in the late 1980s. nawk and GNU's gawk make it possible to do the following within an awk script:
Include programmer-defined functions.
Execute external programs and process the results.
Manipulate command line arguments more easily.
Manage multiple I/O streams.
As a reference, Tables 1 and 2 define the most common built-in variables and functions. Also, note that the following operators each have the same meaning in awk as they do in C (refer to the awk man page):
* / % + - = ++ -- += -= *= /= %=
Scripting languages and specialty tools that allow rapid development have been widely accepted for quite some time. Both awk and sed deserve a spot on any Linux developer's and administrator's workbench. Both tools are a standard part of any Linux platform. Together, awk and sed can be used to implement virtually any text filtering application—such as perform repetitive edits to data streams and files and generate formatted reports.
The most current reference book for both awk and sed is the O'Reilly release awk and sed by Dale Dougherty and Arnold Robbins. Also see Effective AWK Programming by Arnold Robbins (SSC). For an immediate on-line synopsis on your Linux system, use the man command as follows:
I hope the information provided here is useful and encourages you to begin or expand your use of these tools. If you exploit what awk and sed offer, you will most certainly save development time and money. Those who know how to quickly apply sharp tools to seemingly complex problems are handsomely rewarded in our field.
Louis Iacona (lji@peakaccess.com) has been designing and developing software systems since 1982 on UNIX/Linux and other platforms. Most recently, his efforts have focused on applying World Wide Web technologies to client/server development projects. He currently manages the Internet development group at ICP in Staten Island, N.Y.