Saturday, December 13, 2008

Learn to Talk AWK

When it comes to slicing and dicing text, few tools are as powerful, or as underutilized, as awk. The name AWK was coined from the initials of it's authors, Aho, Weinberger, and Kernighan. Yes, the same Kernighan of the famous K&R C Programming Language book. In the Linux world, every distribution includes the GNU version, gawk, (/bin/awk is usually a symbolic link to /bin/gawk). OS X ships with the BSD version of awk, more closely related to the original UNIX awk. The focus of the article is on the core features common among POSIX-compliant awks.

note: originally published January 16, 2006 on linux.com
updated for OS X on October 2, 2007

Basic command line usage


The awk utility is a small program that executes awk language scripts, often one-liners, but just as adeptly larger programs saved in a text file. For example, to execute an awk script saved in the file prg1.awk, and have it process the file, data1, you could use this command:

awk -f prg1.awk data1

The result is written to standard out, so it is usually piped to a result file.

Another command line parameter commonly used is -F to change the default field separator of a blank space. The field separator can also be changed within an awk program. To tell awk how to split data into fields from a comma separated variable (CSV) file, you would use:

awk -F"," -f prg1.awk data1

You may also include more than one data file to process and awk will keep running until it runs out of data:

awk -F"," -f prg1.awk data1 data2 data3 data4 data5

If you want to assign a value to a variable before execution of the program, use the -v option:

awk -v AMOUNT=100 prg1.awk data1

Behold the power

The power of awk comes from how much it does automatically for you when crunching text files, and from the simple elegance of the language. When you feed awk a text file, it does the following things for you:

  1. Opens and reads all input files listed on the command line
  2. Handles memory management for all variables
  3. Parses each line and splits it into fields using the field separator (the default is a blank space, but can be changed)
  4. Presents each line of text to your program as variable $0
  5. Presents each field from each line in predefined variables, starting with $1, $2, ... $N
  6. Maintains many internal variables for your use such as (but not limited to ):
    • RS = record separator
    • FS = field separator
    • NF = number of fields in the current record
    • NR = number of records processed so far
  7. Automatically handles conversion between internal data types (string, floating point, array)
  8. Executes the BEGIN block before processing any records (a good place to initialize variables)
  9. Executes the END block after processing all records (a good place to calculate report totals)
  10. Closes all input files listed on the command line

The awk language uses only three internal data types: strings, floating point numbers, and arrays. Variables do not have to be defined before they are used. Awk handles converting data from one type to another as necessary. If you add two strings together using the addition operator (+) and they contain numeric values, you get a numeric result. If a string is used in an arithmetic operation but can't be converted to a number, it is converted to zero. Usually, awk does what you want when handling data conversion.

It can open, read, and write to more files than those listed on the command line by using the getline function or redirecting output from within a program. It has access to a set of internal functions that include math, string manipulation, formatted printing (similar to the C language printf), and miscellaneous functions like pseudo-random numbers. You can also create your own functions or function libraries that can be used in several programs. All of this is packed into an executable usually about 500k in size. Programmers can typically become proficient in awk within a day. Complete references are available in a single book. You don't need a "bookshelf" of dead trees and CDs to master awk.

Implementations or ports of awk are available on nearly every platform, making your scripts reasonably portable.

AWK in the real world

Here is a short example of an awk application I created to import a list of email addresses and names from Novell Groupwise to PHPList (a mailing list manager). The list was exported from Groupwise in vCard File format (VCF), a text based format. Here is an example entry from the VCF file:

BEGIN:VCARD
VERSION:2.1
X-GWTYPE:USER
FN:Bar, Foo
ORG:;GREEN
EMAIL;WORK;PREF:foobar@yahoo.com
N:Bar;Foo
X-GWUSERID:foobar
X-GWADDRFMT:0
X-GWIDOMAIN:yahoo.com
X-GWTARGET:TO
END:VCARD

The target format was a CSV file that PHPList could import into an existing mailing list. I needed to extract the name (from the record that starts with "FN" and the email address from the record that starts with "EMAIL". These records are easy to identify and a small awk script does the job nicely.

I started construction of the script by setting up a custom record separator and a block of code to handle each record type. I saved the script in a text file called extract-emails.awk. Note that the .awk file extension is just convention, the file containing awk commands can be named anything. This was the beginning of the script:

BEGIN { FS = ":" }

/^FN/ {
# handle name here
}

/^EMAIL/ {
# handle email address here
}

The BEGIN block is run once before any records are read. It sets the field separator to a colon so awk will split the fields of the file when a colon is encountered.

The regular expressions /^FN/ and /^EMAIL/ tell awk to look for the characters "FN" or "EMAIL" at the start of a record, and if a match is found, run the associated block of code between the curly braces. This kind of regular expression match is common in awk but not required. A block of code with no match expression is run for every record processed by awk. I added a couple of comments (lines starting with "#") to document what each part of the script does.

Looking at the VCF data, I noticed that the "FN" record always precedes the "EMAIL" record, so I ordered the code blocks to process the records that way. Awk reads and executes a script in the order it appears. Many times, the order of the code will not matter, but in this case it does. The name is related to the email and I need to retain that relationship as the file is read, so I saved the name in an internal variable, then wrote both the email address and name to standard out while processing the email record.

Getting back to the task, let's complete the name section. The goal is to reformat the name from "lastname, firstname" into "firstname lastname", removing the comma. Here was my solution for the name:

/^FN/ {
# handle name here
fullname = tolower($2)
split(fullname, names, ",")
name = names[2] names[1]
}

Knowing that awk has split up the incoming records into fields using a colon as the field separator, the field variables for the example "FN" record contain the following:

$1 = "FN"
$2 = "Bar, Foo"

Working with the $2 variable, I used a built in awk function, towlower(), to convert the names to lowercase and stored the result in a variable called "fullname". Next, I used the split function to break the name into first and last name parts, with the result stored in an array called "names". Finally, I glued the name back together in the desired order, without the comma, and stored that result in a variable called "name".

There is very little to do inside the email code block. Awk provides the email address to us in the $2 variable (note that $2 in the "EMAIL" record is different than $2 in the "FN" record). For consistency, I converted it to lowercase, then used the print function to write both the email address and name to standard out, with a comma separating the values. Here is the complete script:

BEGIN { FS = ":" }

/^FN/ {
# handle name here
fullname = tolower($2)
split(fullname, names, ",")
name = names[2] names[1]
}

/^EMAIL/ {
# handle email address here
mail = tolower($2)
print mail "," name
}

A sprinkle of shell glue

To pull it all together, we need a little shell glue. A small shell script allows us to call awk with the command line parameters we want and to easily redirect the output to a file. It is also handy to run a shell script when you are testing.

#!/bin/sh
# Extract e-mail addresses from VCF file for PHPList.
awk -f extract-emails.awk groupwise.vcf > phplist-emails.txt

Awk can be used as an intermediate step in a larger shell script where the output is fed into another utility such as sort, grep, or another awk script.

Finally, here is a sample of the output:

foobar@yahoo.com, foo bar
barbaz@yahoo.com, bar baz

Where AWK falls short

There are certain tasks that are beyond the capabilities of awk. For instance, if you need to do anything that communicates using network sockets, awk is not your best bet. Similarly, if you need to process binary files, awk falls short. The latest version of GNU awk does have some rudimentary network capabilities, but Perl, PHP, and Ruby are much better equipped for those tasks.

A million household uses

Awk is an expert tool for text processing, and the roots of Perl are clear in it's design. It is powerful enough to handle almost any kind of text crunching or reporting, while being very easy to learn and use. There is a lot of competition and many choices when it comes to scripting languages, but I still find awk the best choice for many problems. Although awk is employed most often for smaller problems, it can be used for large applications. I worked on a 12,000 line awk application used to adjudicate dental claims. This application was the core system for a successful million dollar business. Despite being a common utility on every Linux system, awk remains relatively obscure. If you take the time to learn it, the rewards will last a lifetime.