GNU Awk 4.1: Teaching an Old Bird Some New Tricks, Part II
In an earlier article ("GNU Awk 4.0: Teaching an Old Bird Some New
Tricks",
published in the September 2011 issue of Linux Journal), I
gave a brief history of awk
and
gawk
and provided a high-level overview
of the many new features in gawk
4.0. I recommend reading that article first,
although you can read this one without doing so, if you wish.
gawk
4.0 itself was released in June 2011. Since then,
the gawk
development team has not been resting on its laurels! gawk
4.1, released
in May 2013, contains a number of new features, and that's what I
cover here.
Unlike gawk
4.0, there are considerably fewer changes at the language
level (although there are some). The changes this time around are more
concerned with internals, and with the ability to interface to the
outside world. So let's get started.
For many years, when you built gawk
, you got two executables: the regular
interpreter, gawk
, and pgawk
, its
profiling twin brother, which ran awk
programs (more slowly) and produced a statement count execution profile
showing how many times each line of code was executed.
With gawk
4.0, you got an additional executable,
dgawk
,
the gawk
debugger.
Although the three versions shared most of the same code, the core parts that
actually executed your awk
program were compiled differently in each one.
For gawk
4.1, all three executables have been merged into a single
program, named just gawk
. Although the combined executable is larger,
it is still smaller than having three separate executables, and
in addition, the documentation is simpler and easier to understand
(and maintain!).
To accommodate this change, the options had to change slightly.
You now use -D
to run the debugger,
-p
to do profiling and
-o
for pretty-printing without profiling.
An important new feature that is visible for the awk
programmer is
arbitrary precision floating-point arithmetic with the GNU MPFR and GMP
libraries.
This is an optional feature: if you have the MPFR and GMP libraries
installed when you configure and build gawk
,
gawk
automatically
will be able to use them.
Note that I said "be able to use them". You still have to
choose to do so either by using the -M
option
(or --bignum
, if you prefer long options), or by setting the special
variable PREC
to the desired floating-point precision.
The precision is the number of bits kept in the floating-point mantissa.
The default is 53, which is the same as that used by hardware double-precision floating point. From the gawk
manual:
$ gawk -M -v PREC=100 'BEGIN { x = 1.0e-400; print x + 0}
> PREC = "double"; print x + 0 }'
1e-400
0
You see that regular hardware can't handle an exponent of -400, whereas MPFR can.
An additional new variable, ROUNDMODE
, sets the rounding mode for
calculations and printing arbitrary precision values.
In the past several years, for reasons I don't quite understand,
I've gotten bug reports from people who expect gawk
's arithmetic
to work exactly like "real" arithmetic done with pencil and paper.
In other words, they want what is known in Computer Science as
decimal arithmetic. I'm not sure why they expect this, but as we all
should know, computers don't quite work that way.
MPFR does not give you decimal arithmetic. However, if you understand what you're doing and how to use it, you can get results that are likely to be good enough for your purposes.
The manual has a full chapter that describes the issues involved with floating-point arithmetic, what it means when you increase the precision, and how to use the various rounding modes supported by MPFR.
New Arrays Provide Indirect Variable AccessThere are three new arrays:
-
SYMTAB
: provides access toawk
-level variables. -
FUNCTAB
: lists the names of all user-defined and extension functions. -
PROCINFO["identifiers"]
: lists all known identifiers and whatgawk
knows about their types after it has parsed the program.
Of these, SYMTAB
is the most interesting, since it provides indirect access
to any variable. For example:
$ gawk 'BEGIN { a = 5 ; print "a =", a
> SYMTAB["a"] += 37
> print "a is now", a }'
a = 5
a is now 42
With the isarray()
built-in function, you can
"walk" the entire symbol
table and print out all variable and array values, if you choose to do so.
The most exciting change in gawk
4.1 is its ability to interface to the
outside world. For many years, gawk
had an
"extension" or "plug in"
mechanism that let a programmer write a new "built-in" function in C,
and load it into the running gawk
interpreter at runtime.
This mechanism required understanding something of the
gawk
internals
and making use of gawk
's internal data structures and functions. Although
it was documented minimally, and it worked, it had several drawbacks.
The most notable one was that there was no backward compatibility across
releases.
Nonetheless, a group of developers forked gawk
to create
xgawk
(XML gawk
)
and developed a number of dynamic extensions and new facilities for the
core executable.
For many years, I had been wanting to provide a defined C API for
writing extensions that would not be dependent upon the
gawk
internals
and that possibly could provide binary compatibility across releases.
For gawk
4.1, together with the
xgawk
developers, we finally made this happen.
Why Do You Need Extensions?
Consider this: an awk
program cannot even change its working directory with
the chdir
system call! awk
is thus
a closed language—one that provides
you with only the facilities that the implementors chose to provide and
no more. That's not much fun. (Well, awk
is fun, but it's still limited.)
By contrast, modern scripting languages are all open and extensible;
Perl, Tcl, Python and Ruby all have thousands of available modules that can
be loaded at runtime. It's past time that gawk
could
do that too.
What You Can Do from an Extension
It is best to think of extension functions as user-defined functions
written in another language. They cannot do everything a user-defined
function can (such as call an awk
function, manipulate the fields, read records
with getline
and so on), but what they can do is enough to make
gawk
more open,
and let it interface with the underlying operating system and with
other C (or C++) libraries. In particular, you can:
-
Pass scalars by value and arrays by reference.
-
Create and modify new global variables and arrays.
-
Access the built-in variables (read-only, although you can update
PROCINFO
). -
Register a function to be called when
gawk
exits. -
Print warning and/or fatal error messages.
-
Update the built-in variable
ERRNO
for when something goes wrong. -
Hook into the I/O redirection mechanisms, providing your own "special" filenames and/or two-way communicators.
-
And of course, register new functions that can be called from
gawk
.
The API provides a number of data types to make it easier to communicate
with gawk
. For example, gawk
strings can contain embedded NUL characters
(all bits zero), so strings have a pointer and a length.
gawk
maintains
reference-counted strings internally, so there are ways to tell
gawk
to reuse a value it already knows about.
In addition, the API lets you "flatten" awk
's associative arrays into
an array of structs for easy iteration in C code, without having to call
into gawk
each time you want to move to the next element in an array.
A full description of the API is beyond the scope of this article; however, the manual includes a full chapter, with examples, describing the API and showing how to use it.
OS Independence
The extension mechanism has been designed to work on multiple operating
systems. At the time of this writing, it works on any *nix system that supports
the POSIX dlopen()
API. This includes Mac OS X. The basic mechanism also
works on Microsoft Windows using MinGW. However, support to build
the sample extensions was not included in the 4.1 release since it was
not ready. This support will be included in the first patch release,
whenever that will be, although not all of the sample extensions can work on
Windows.
Sample Extensions
The gawk
distribution provides a number of small, sample extensions.
Their main purpose is to serve as examples of how to use the API, but
nonetheless they should be usable for real work also. The full list is
documented in the manual. Some of the more interesting ones are:
-
The "filefuncs" extension, which provides
chdir()
andstat()
functions, and also an interface to the fts(3) suite of routines for walking a file hierarchy. -
The "fnmatch" extension, which provides an
awk
version of the fnmatch(3) suite. -
The "readdir" extension, which returns records for the contents of directories named on the
gawk
command line or read withgetline
. (Normally, it's a nonfatal error to try to read a directory. With otherawks
, it's fatal.) -
The "inplace" extension, which simulates the GNU
sed -i
feature for in-place editing of command-line data files.
Additional, more specialized extensions illustrate the use of parts of the API not covered by the extensions just listed.
The gawkextlib Project
Now that gawk
supports the major
xgawk
features, the xgawk
developers
have reoriented their project around their specific extensions. It no
longer includes the forked gawk
code base. To emphasize this change in
orientation, they renamed their project "gawkextlib".
It is their (and my) hope that this project can serve as a central
clearinghouse for new gawk
extensions that may be written
by the awk
community over time.
The gawkextlib project currently has four extensions:
-
The XML extension, which adds several new variables and an input parser, letting
gawk
parse XML files in a natural fashion. This extension is built on top of the Expat XML parser. This is a powerful extension; instead of having to try to parse XML files with regular expressions manually, the Expat parser does it for you, including all the icky validation stuff that would be really hard to do in straightawk
code. -
The PostgreSQL extension, which provides functions for talking to PostgreSQL databases.
-
The GD graphics library extension, for use with the GD graphics library (see Resources).
-
The MPFR library extension. This extension gives you access to a number of MPFR functions that are not accessible from
gawk
's built-in MPFR support.
I feel that gawk
as a language has largely reached maturity, and do
not wish to add too many more features. That said, there are a few
items still open for exploration:
-
Additional numeric facilities, such as possible integration with a decimal arithmetic library.
-
A way to map
gawk
arrays onto external storage, such as DBM arrays or SQL databases. -
A "namespace" facility for extension functions and variables, and possibly regular
gawk
-level variables and functions as well. This would be a major design activity.
Of course, describing the above items does not constitute a commitment to do any of them.
Conclusion
The new API and extension facility opens new horizons for
gawk
and
for awk
programmers. I am very excited about it, and I hope to see
gawk
used for many new things where it simply was not applicable before.
Thanks to Scott Deifik, Dr Brian W. Kernighan, Dr Nelson Beebe and Eli Zaretskii for comments on the initial draft of this article.
The entire gawk
development team deserves kudos for their work on
this release. It was very much a team effort.
"GNU Awk 4.0: Teaching an Old Bird Some New Tricks", LJ, September 2011: http://www.linuxjournaldigital.com/linuxjournal/201109#pg94
The gawk
distribution: http://ftp.gnu.org/gnu/gawk/gawk-4.1.0.tar.gz
Documentation On-line: http://www.gnu.org/software/gawk/manual
Arbitrary Precision Arithmetic with gawk
: http://www.gnu.org/software/gawk/manual/html_node/Arbitrary-Precision-Arithmetic.html#Arbitrary-Precision-Arithmetic
Dynamic Extensions: http://www.gnu.org/software/gawk/manual/html_node/Dynamic-Extensions.html#Dynamic-Extensions
gawkextlib Home Page: http://gawkextlib.sourceforge.net
gawkextlib Download: http://sourceforge.net/projects/gawkextlib
The GD Graphics Library: http://www.boutell.com/gd/manual2.0.33.html
The Expat XML Parser: http://expat.sourceforge.net