Awk is a standard part of UNIX and Unix clones. Over the years, no less than five different C AWK interpreters, a Java-language AWK interpreter, and two AWK compilers have been made. Here is a chart comparing the five C-language AWK interpreters:
Name | Primary Maintiner | Last release |
Original AWK | Brian Kernighan | April 24, 2005 |
Mawk | Michael D. Brennan | September 1996 |
Gawk | Arnold D. Robbins | July 26, 2005 |
Busybox AWK | Dmitry Zakharov | October 29, 2006 | MKS AWK | The OpenSolaris team | October 2006 |
All five of these AWK implementations are open source, and can be freely downloaded here. Just click on the name to download the file in question. Note that the AWK in Busybox is just a small part of a much bigger toolkit for embedded systems; this particular Awk, when compiled, results in the smallest AWK binary of any of the above five Awks. Gawk creates the largest binary.
The original AWK is just that: The very first Awk implementation; this implementation has code going back to the late 1970s. It has been updated to compile and run on modern systems, such as Linux, Windows, and FreeBSD. This version of AWK became open-source code in 1996. This is the default AWK that comes with FreeBSD.
The next AWK implementation to be implemented is MKS AWK. This implementation dates back to the mid-1980s. The source code of this implementation recently became public when OpenSolaris was released. OpenSolaris uses both this verion of AWK and earlier versions of the original AWK. I have ported this version to Linux; click on "MKS AWK" to download the Linux port.
The next AWK implmentations to be implemented are the open-source Gawk and Mawk implementations. Both the Free Software Foundation and Michael Brennan wanted to have a free version of Awk in the late 1980s and early 1990s; not aware of the other's work, these two independant free implementations of AWK were made around the same time. Mawk is the default AWK that comes with Debian and Debian-derived distributions, such as Ubuntu. Gawk is the default AWK that comes with most other Linux distributions.
The next AWK implementation to be made is the AWK that comes with Busybox. Busybox is a project to make the standard UNIX tools available using as little memory and disk space as possible. Dmitry Zakharov implemented AWK for Busybox starting in 2002. While earlier versions had a number of bugs and incompatibilities, Mr. Zakharov has been actively maintaining this version of AWK; more recent versions are both POSIX-compliant and able to run legacy AWK scripts. This is, not surprisingly, the smallest AWK implementation.
There are some other interesting AWK implementations out there: xgawk extends AWK to have Database connectivy and other features missing in the traditional AWKs. awka is a project that allows one to make C programs from AWK scripts. jawk is a project to implement AWK in Java.
There is a perception that POSIX requires these regular expression to break in non-C/English locales. This is not true; the standard merely states that ranges may break. E.G. Posix 9.3.5 section 7: "In other [non-POSIX] locales, a range expression has unspecified behavior". (The "POSIX" locale is also known as the "C" locale).
Dr. Kernighan (the "K" in AWK), when dealing with this issue said
strcoll is meant for sorting, where merging upper and lower case may make sense (though note that unix sort does not do this by default either). it is not appropriate for regular expressions(See the "FIXES" file included with his implementation of AWK)
Since POSIX allows internationalization to maintain compatibility with legacy AWK scripts, and one of the three original implementors of AWK feels that such scripts must not be broken, I have a patch for Gawk that fixes this problem. This patch maximizes compatibility and minimizes the number of scripts that will break; ranges with non-ASCII characters still have the international-aware behavior.
POSIX character classes seem to be pretty rare in AWK scripts; there is only one bug reported in the Ubuntu bug database where someone had a problem with this.
I have a patch that adds POSIX character class support to Mawk. Considering Debian's speed of development and Ubuntu's seeming lack of interest in updating their core utilities, it will probably take years for this patch to become a part of these distributions.
Indeed, I am not the first person to try and update Mawk's regex engine. Aleksey Cheusov, in the summer of 2005, patched Mawk to use an external regular expression engine. I have made a copy of the patch which people can download. His approach is different; instead of updating Mawk to have more features in its own regular expression engine, he simply has Mawk use an external engine.
His patch allows a variety of external regular expression engines to be used. For example, to use libc's regex engine:
./configure && make
Or, to use the "tre" regex engine:
CFLAGS='-O3 -I/usr/include/tre' LDFLAGS='-ltre' ./configure && make
Note that, after applying his patch, autoconf needs to be run to create a new 'configure' script.