GNU Awk 4.1: Teaching an Old Bird Some New Tricks, Part II

on October 28, 2013

In an earlier article ("GNU Awk 4.0: Teaching an Old Bird Some New Tricks", published in the September 2011 issue of Linux Journal), I gave a brief history of awk and gawk and provided a high-level overview of the many new features in gawk 4.0. I recommend reading that article first, although you can read this one without doing so, if you wish.

gawk 4.0 itself was released in June 2011. Since then, the gawk development team has not been resting on its laurels! gawk 4.1, released in May 2013, contains a number of new features, and that's what I cover here.

Unlike gawk 4.0, there are considerably fewer changes at the language level (although there are some). The changes this time around are more concerned with internals, and with the ability to interface to the outside world. So let's get started.

Reduced Footprint

For many years, when you built gawk, you got two executables: the regular interpreter, gawk, and pgawk, its profiling twin brother, which ran awk programs (more slowly) and produced a statement count execution profile showing how many times each line of code was executed.

With gawk 4.0, you got an additional executable, dgawk, the gawk debugger. Although the three versions shared most of the same code, the core parts that actually executed your awk program were compiled differently in each one.

For gawk 4.1, all three executables have been merged into a single program, named just gawk. Although the combined executable is larger, it is still smaller than having three separate executables, and in addition, the documentation is simpler and easier to understand (and maintain!).

To accommodate this change, the options had to change slightly. You now use -D to run the debugger, -p to do profiling and -o for pretty-printing without profiling.

Arbitrary Precision Arithmetic with MPFR and GMP

An important new feature that is visible for the awk programmer is arbitrary precision floating-point arithmetic with the GNU MPFR and GMP libraries.

This is an optional feature: if you have the MPFR and GMP libraries installed when you configure and build gawk, gawk automatically will be able to use them.

Note that I said "be able to use them". You still have to choose to do so either by using the -M option (or --bignum, if you prefer long options), or by setting the special variable PREC to the desired floating-point precision.

The precision is the number of bits kept in the floating-point mantissa. The default is 53, which is the same as that used by hardware double-precision floating point. From the gawk manual:


$ gawk -M -v PREC=100 'BEGIN { x = 1.0e-400; print x + 0}
> PREC = "double"; print x + 0 }'
1e-400
0

You see that regular hardware can't handle an exponent of -400, whereas MPFR can.

An additional new variable, ROUNDMODE, sets the rounding mode for calculations and printing arbitrary precision values.

In the past several years, for reasons I don't quite understand, I've gotten bug reports from people who expect gawk's arithmetic to work exactly like "real" arithmetic done with pencil and paper. In other words, they want what is known in Computer Science as decimal arithmetic. I'm not sure why they expect this, but as we all should know, computers don't quite work that way.

MPFR does not give you decimal arithmetic. However, if you understand what you're doing and how to use it, you can get results that are likely to be good enough for your purposes.

The manual has a full chapter that describes the issues involved with floating-point arithmetic, what it means when you increase the precision, and how to use the various rounding modes supported by MPFR.

New Arrays Provide Indirect Variable Access

There are three new arrays:

SYMTAB: provides access to awk-level variables.
FUNCTAB: lists the names of all user-defined and extension functions.
PROCINFO["identifiers"]: lists all known identifiers and what gawk knows about their types after it has parsed the program.

Of these, SYMTAB is the most interesting, since it provides indirect access to any variable. For example:


$ gawk 'BEGIN { a = 5 ; print "a =", a
> SYMTAB["a"] += 37
> print "a is now", a }'
a = 5
a is now 42

With the isarray() built-in function, you can "walk" the entire symbol table and print out all variable and array values, if you choose to do so.

Dynamic Extensions

The most exciting change in gawk 4.1 is its ability to interface to the outside world. For many years, gawk had an "extension" or "plug in" mechanism that let a programmer write a new "built-in" function in C, and load it into the running gawk interpreter at runtime.

This mechanism required understanding something of the gawk internals and making use of gawk's internal data structures and functions. Although it was documented minimally, and it worked, it had several drawbacks. The most notable one was that there was no backward compatibility across releases.

Nonetheless, a group of developers forked gawk to create xgawk (XML gawk) and developed a number of dynamic extensions and new facilities for the core executable.

For many years, I had been wanting to provide a defined C API for writing extensions that would not be dependent upon the gawk internals and that possibly could provide binary compatibility across releases.

For gawk 4.1, together with the xgawk developers, we finally made this happen.

Why Do You Need Extensions?

Consider this: an awk program cannot even change its working directory with the chdir system call! awk is thus a closed language—one that provides you with only the facilities that the implementors chose to provide and no more. That's not much fun. (Well, awk is fun, but it's still limited.)

By contrast, modern scripting languages are all open and extensible; Perl, Tcl, Python and Ruby all have thousands of available modules that can be loaded at runtime. It's past time that gawk could do that too.

What You Can Do from an Extension

It is best to think of extension functions as user-defined functions written in another language. They cannot do everything a user-defined function can (such as call an awk function, manipulate the fields, read records with getline and so on), but what they can do is enough to make gawk more open, and let it interface with the underlying operating system and with other C (or C++) libraries. In particular, you can:

Pass scalars by value and arrays by reference.
Create and modify new global variables and arrays.
Access the built-in variables (read-only, although you can update PROCINFO).
Register a function to be called when gawk exits.
Print warning and/or fatal error messages.
Update the built-in variable ERRNO for when something goes wrong.
Hook into the I/O redirection mechanisms, providing your own "special" filenames and/or two-way communicators.
And of course, register new functions that can be called from gawk.

The API provides a number of data types to make it easier to communicate with gawk. For example, gawk strings can contain embedded NUL characters (all bits zero), so strings have a pointer and a length. gawk maintains reference-counted strings internally, so there are ways to tell gawk to reuse a value it already knows about.

In addition, the API lets you "flatten" awk's associative arrays into an array of structs for easy iteration in C code, without having to call into gawk each time you want to move to the next element in an array.

A full description of the API is beyond the scope of this article; however, the manual includes a full chapter, with examples, describing the API and showing how to use it.

OS Independence

The extension mechanism has been designed to work on multiple operating systems. At the time of this writing, it works on any *nix system that supports the POSIX dlopen() API. This includes Mac OS X. The basic mechanism also works on Microsoft Windows using MinGW. However, support to build the sample extensions was not included in the 4.1 release since it was not ready. This support will be included in the first patch release, whenever that will be, although not all of the sample extensions can work on Windows.

Sample Extensions

The gawk distribution provides a number of small, sample extensions. Their main purpose is to serve as examples of how to use the API, but nonetheless they should be usable for real work also. The full list is documented in the manual. Some of the more interesting ones are:

The "filefuncs" extension, which provides chdir() and stat() functions, and also an interface to the fts(3) suite of routines for walking a file hierarchy.
The "fnmatch" extension, which provides an awk version of the fnmatch(3) suite.
The "readdir" extension, which returns records for the contents of directories named on the gawk command line or read with getline. (Normally, it's a nonfatal error to try to read a directory. With other awks, it's fatal.)
The "inplace" extension, which simulates the GNU sed -i feature for in-place editing of command-line data files.

Additional, more specialized extensions illustrate the use of parts of the API not covered by the extensions just listed.

The gawkextlib Project

Now that gawk supports the major xgawk features, the xgawk developers have reoriented their project around their specific extensions. It no longer includes the forked gawk code base. To emphasize this change in orientation, they renamed their project "gawkextlib".

It is their (and my) hope that this project can serve as a central clearinghouse for new gawk extensions that may be written by the awk community over time.

The gawkextlib project currently has four extensions:

The XML extension, which adds several new variables and an input parser, letting gawk parse XML files in a natural fashion. This extension is built on top of the Expat XML parser. This is a powerful extension; instead of having to try to parse XML files with regular expressions manually, the Expat parser does it for you, including all the icky validation stuff that would be really hard to do in straight awk code.
The PostgreSQL extension, which provides functions for talking to PostgreSQL databases.
The GD graphics library extension, for use with the GD graphics library (see Resources).
The MPFR library extension. This extension gives you access to a number of MPFR functions that are not accessible from gawk's built-in MPFR support.

The Future

I feel that gawk as a language has largely reached maturity, and do not wish to add too many more features. That said, there are a few items still open for exploration:

Additional numeric facilities, such as possible integration with a decimal arithmetic library.
A way to map gawk arrays onto external storage, such as DBM arrays or SQL databases.
A "namespace" facility for extension functions and variables, and possibly regular gawk-level variables and functions as well. This would be a major design activity.

Of course, describing the above items does not constitute a commitment to do any of them.

Conclusion

The new API and extension facility opens new horizons for gawk and for awk programmers. I am very excited about it, and I hope to see gawk used for many new things where it simply was not applicable before.

Acknowledgements

Thanks to Scott Deifik, Dr Brian W. Kernighan, Dr Nelson Beebe and Eli Zaretskii for comments on the initial draft of this article.

The entire gawk development team deserves kudos for their work on this release. It was very much a team effort.

Resources

"GNU Awk 4.0: Teaching an Old Bird Some New Tricks", LJ, September 2011: http://www.linuxjournaldigital.com/linuxjournal/201109#pg94

The gawk distribution: http://ftp.gnu.org/gnu/gawk/gawk-4.1.0.tar.gz

Documentation On-line: http://www.gnu.org/software/gawk/manual

Arbitrary Precision Arithmetic with gawk: http://www.gnu.org/software/gawk/manual/html_node/Arbitrary-Precision-Arithmetic.html#Arbitrary-Precision-Arithmetic

Dynamic Extensions: http://www.gnu.org/software/gawk/manual/html_node/Dynamic-Extensions.html#Dynamic-Extensions