COMS W4115
Programming Languages and Translators
Lecture 3: September 16, 2009
An AWK Tutorial

Overview

  1. Review
  2. An AWK tutorial
  3. Language processing tools
  4. Overview of compilation
  5. Reading

1. An AWK Tutorial

    This section provides a suggested example for a language tutorial. For the most part, the material consists of excerpts from the first chapter "An Awk Tutorial" of the book The AWK Programming Language by Aho, Kernighan and Weinberger.

    1.1 Getting Started

    Awk is a convenient and expressive programming language that can be applied to a wide variety of computing and data-manipulation tasks.

    Suppose you have a file called emp.data that contains the name, pay rate in dollars per hour, and the number of hours worked for your employees, one employee record per line, like this:
    
       Beth  4.00  0
       Dan   3.75  0
       Kathy 4.00  10
       Mark  5.00  20
       Mary  5.50  22
       Susie 4.25  18
      
    Now you want to print the name and pay (rate times hours) for everyone who worked more than zero hours. Just type this command line:
    
       awk '$3 > 0 { print $1, $2 * $3 }' emp.data
      
    You should get this output:
    
       Kathy 40
       Mark 100
       Mary 121
       Susie 76.5
      

    The Structure of an AWK Program

    In the command line above, the part between the quote characters is the program in the awk programming language. Each awk program is a sequence of one or more pattern-action statements:
       pattern	{ action }
       pattern	{ action }
       . . .
      
    The basic operation of awk is to scan a sequence of input lines one after another, searching for lines that are matched by any of the patterns in the program. Every input line is tested against each of the patterns in turn. For each pattern that matches, the corresponding action is performed. Then the next line is read and the matching starts over. This continues until all the inut is read.

    Running an AWK Program

    There are several ways to run an awk program. You can type a command line of the form
       awk 'program' input files
      
    to run the program on each of the specifed input files.
    You can omit the input files from the command line and just type>/dt>
       awk 'program'
      
    In this case awk will apply the program to whatever you type next on your terminal until you type an end-of-file signal.
    This arrangement is convenient when the program is short (a few lines). If the program is long, however, it is more convenient to put it into a separate file, say progfile, and the type the command line
       awk -f progfile optional list of input files
      
    The -f option instructs awk to fetch the program from the named program file. For the rest of this tutorial we will show just the program, not the whole command line.

    1.2 Simple Output

    There are only two types of data in awk: numbers and strings of characters. The emp.datafile is typical of this kind of information -- a mixture of words and numbers separated by blanks and/or tabs.
    Awk reads its input one line at a time and splits each line into fields, where, by default, a field is a sequence of characters that doesn't contain any blanks or tabs. The first field in the current input line is called $1, the second $2, and so forth. The entire line is called $0. The number of fields can vary from line to line.

    Often, all we need to do is print some or all of the fields of each line, perhaps performing some calculations. The programs in this section are all of that form.

    Printing Every Line

    If an action has no pattern, the action is performed for all input lines. The statement print by itself prints the current input line, so the program
       { print }
      
    prints all of its input on the standard output. Since $0 is the whole line,
       { print $0 }
      
    does the same thing.

    Printing Certain Fields

    More than one item can be printed on the same output line with a single print statement. The program to print the first and third fields of each input line is
       { print $1, $3 }
      
    Expressions separated by a comma in a print statement are, by default, seperated by a single blank when they are printed. Each line produced by print ends with a newline character. Both of these defaults can be changed.

    NF, the Number of Fields

    It might appear you must always refer to fields as $1, $2, and so on, but any expression can be used after $ to denote a field number; the expression is evaluated and its numeric value is used as the field number. Awk counts the number of fields in the current line and stores the count in a built-in variable called NF. Thus, the program
       { print NF, $1, $NF }
      
    prints the number of fields and the first and last fields of each input line.

    Computing and Printing

    You can also do computations on the field values and include the results in what is printed. The program
       { print $1, $2 * $3 }
      
    above prints the name and total pay for each employee.

    Printing Line Numbers

    Awk provides another built-in variable, called NR, that counts the number of lines read so far. We can use NR and $0 to prefix each line of emp.data with its line number:
       { print NR, $0 }
      
    The output looks like this:
    
       1 Beth  4.00  0
       2 Dan   3.75  0
       3 Kathy 4.00  10
       4 Mark  5.00  20
       5 Mary  5.50  22
       6 Susie 4.25  18
      

    Putting Text in the Output

    You can also print words in the midst of fields and computed values:
       { print "total pay for", $1, "is", $2 * $3 }
      
    prints
    
       total pay for Beth is 0
       total pay for Dan is 0
       total pay for Kathy is 40
       total pay for Mark is 100
       total pay for Mary is 121
       total pay for Susie is 76.5
      
    In the print statement, the text inside the double quotes is printed along with the fields and computed values.

    1.3 Fancier Output

    The print statement is meant for quick and easy output. To format the output exactly the way you want it, you may have to use the printf statement. As we see in the AWK language reference manual, printf can produce almost any kind of output, but in this section we'll only show a few of its capabilities.

    Lining Up Fields

    The printf statement has the form
       printf(format, value1,...,valuen)
      
    where format is a string that contains text to be printed verbatim, interspersed with specifications of how each of the values is to be printed. A specification is a % followed by a few characters that control the format of a value. The first specification tells how value1 is to be printed, the second how value2 is to be printed, and so on. Thus, there must be as many % specifications in format as values to be printed.
    Here's a program that uses printf to print the total pay for every employee:
       { printf("total pay for %s is $%.2f\n", $1, $2 * $3) }
      
    The specification string in the printf statement contains two % specifications. The first, %s, says to print the first value, $1, as a string of characters, the second, %.2f, says to print the second value, $2*$3, as a number with 2 digits after the decimal point. Everything else in the specification string, including the dollar sign, is printed verbatim; the \n at the end of the string stands for a newline, which causes subsequent output to begin on the next line. With emp.data as input, this program yields:
    
       total pay for Beth is $0.00
       total pay for Dan is $0.00
       total pay for Kathy is $40.00
       total pay for Mark is $100.00
       total pay for Mary is $121.00
       total pay for Susie is $76.50
      
    With printf, no blanks or newlines are produced automatically; you must create them yourself. Don't forget the \n.

    Sorting the Output

    Suppose you want to print all the data for each employee, along with his or her pay, sorted in order of increasing pay. The easiest way is to use awk to prefix the total pay to each employee record, and run that output through a sorting program. On Unix, the command line
    
       awk '{ printf("%6.2f  %s\n", $2 * $3, $0) }' emp.data | sort
      
    pipes the output of awk into the sortcommand, and produces:
    
         0.00  Beth  4.00  0
         0.00  Dan   3.75  0
        40.00  Kathy 4.00  10
        76.50  Susie 4.25  18
       100.00  Mark  5.00  20
       121.00  Mary  5.50  22
      

    1.4 Selection

    Awk patterns are good for selecting interesting lines from the input for further processing. Since a pattern without an action prints all lines matching the pattern, many awk programs consist of nothing more than a single pattern. This section gives some examples of useful patterns.

    Selection by Comparison

    This program uses a comparison pattern to select the records of employees who earn $5.00 or more per hour, that is, lines in which the second field is greater than or equal to 5:
       $2 >= 5
      
    It selects these lines from emp.data:
    
       Mark  5.00  20
       Mary  5.50  22
      

    Selection by Computation

    The program
    
       $2 * $3 > 50 { printf("$%.2f for %s\n", $2 * $3, $1) }
      
    prints the pay of those employees whose total pay exceeds $50:
    
       $100.00 for Mark
       $121.00 for Mary
       $76.50 for Susie
      

    Selection by Text Content

    Besides numeric tests, you can select input lines that contain specific words or phrases. This program prints all lines in which the first field is Susie:
       $1 == "Susie"
      
    The operator == tests for equality. You can also look for text containing any of a set of letters, words, and phrases by using patterns called regular expressions. This program prints all lines that contain Susie anywhere:
       /Susie/
      
    The output is this line:
    
       Susie  4.25  18
      
    Regular expressions can be used to specify much more elaborate patterns.

    Combinations of Patterns

    Patterns can be combined with parentheses and the logical operators &&, ||, and !, which stand for AND, OR, and NOT. The program
       $2 >= 4 || $3 >= 20
      
    prints those lines where $2 is at least 4 or $3 is at least 30:
    
       Beth  4.00  0
       Kathy 4.00  10
       Mark  5.00  20
       Mary  5.50  22
       Susie 4.25  18
      
    Lines that satisfy both conditions are printed only once.

    BEGIN and END

    The special pattern BEGIN matches before the first line of the first input file is read, and END matches after the last line of the last file has been processed. This program uses BEGIN to print a heading:
       BEGIN  { print "NAME  RATE  HOURS"; print "" }
              { print }
      
    The output is:
    
       NAME  RATE  HOURS
     
       Beth  4.00  0
       Dan   3.75  0
       Kathy 4.00  10
       Mark  5.00  20
       Mary  5.50  22
       Susie 4.25  18
      
    You can put several statements on a single line if you separate them by semicolons. Note that print "" prints a blank line, quite different from just plain print, which prints the current line.

    1.5 Computing with AWK

    An action is a sequence of statements separated by newlines or semicolons. This section provides examples of statements for performing simple numeric and string computations. In these statements you can use not only the built-in variables like NF, but you can create your own variables for performing calculations, storing data, and the like. In awk, user-created variables are not declared.

    Counting

    This program uses a variable emp to count employees who have worked more than 15 hours:
       $3 > 15  { emp = emp + 1 }
       END      { print emp, "employees worked more than 15 hours" }
      
    For every line in which the third field exceeds 15, the previous value of emp is incremented by 1. With emp.data as input, this program yields:
    
       3 employees worked more than 15 hours
      
    Awk variables used as numbers begin life with the value 0, so we didn't need to initialize emp.

    To count the number of employees, we can use the built-in variable NR, which holds the number of lines read so far; its value at the end of all input is the total number of lines read.
    
       END      { print NR, "employees" }
      
    The output is:
    
       6 employees
      

    String Concatenation

    New strings may be created by combining old ones; this operation is called concatenation. The program
    
               { names = names $1 " " }
       END     { print names }
      
    collects all the employee names into a single string, by appending each name and a blank to the previous value in the variable names. The value of names is printed by the END action:
    
       Beth Dan Kathy Mark Mary Susie
      
    The concatenation operation is represented in an awk program by writing string values one after the other. At every input line, the first statement in the program concatenates three strings: the previous value of names, the first field, and a blank; it then assigns the resulting string to names. Variables used to store strings begin life holding the null string, so in this program names did not have to be explicitly initialized.

    Built-in Functions

    We have already seen that awk provides built-in variables that maintain frequently used quantities like the number of fields and the input line number. Similarly, there are built-in functions for computing other useful values. Besides arithmetic functions for square roots, logarithms, random numbers, and the like, there are also functions that manipulate text. One of these is length, which counts the number of characters in a string. For example, this program computes the length of each person's name:
    
       { print $1, length($1) }
      
    The result:
    
       Beth 4
       Dan 3
       Kathy 5
       Mark 4
       Mary 4
       Susie 5
      

    Counting Lines, Words, and Characters

    This program uses length, NF, and NR to count the number of lines, words, and characters in the input. For convenience, we'll treat each field as a word.
    
               { nc = nc + length($0) + 1
                 nw = nw + NF }
       END     { print NR, "lines, ", nw, "words, ", nc, "characters" }
      
    The file emp.data has
    
       6 lines, 18 words, 77 characters
      
    We have added one for the newline character at the end of each input line, since $0 doesn't include it.

    1.6 Control-Flow Statements

    Awk provides statements for making decisions and writing loops, mostly modeled on those found in the C programming language. They can only be used in actions.

    If-Else Statement

    The following program computes the total and average pay of employees making more than $6.00 an hour. It uses an if to defend against division by zero in computing the aveage pay.
    
       $2 > 6  { n = n + 1; pay = pay + $2 * $3 }
       END     { if (n > 0)
                    print n, "employees, total pay is", pay,
                    "average pay is", pay/n
                 else
                     print "no employees are paid more than $6/hour"
               }
      
    In the if-else statement, the condition following the if is evaluated. If it is true, the first print statement is performed. Otherwise, the second print statement is performed. Note that we can continue a long statement over several lines by breaking it after a comma.

    1.7 Arrays

    Awk provides arrays for storing groups of related values. Unlike many earlier languages, array subscripts in awk are strings of characters. This gives awk a capability like the associative memory of SNOBOL4 tables, and for this reason, arrays in awk are called associative arrays.

    Suppose some employees get a raise and work at two different pay rates:
    
       Beth  4.00  0
       Dan   3.75  0
       Kathy 4.00  10
       Mark  5.00  20
       Mary  5.50  22
       Susie 4.25  18
       Kathy 5.25  25
       Susie 5.75  12
      
    You now want to compute and print for each employee who worked the employee's total number of hours and total pay. The following awk program does the job:
    
       $3 > 0  { hours[$1] += $3; pay[$1] += $2 * $3 }
       END     { for (emp in hours)
                    printf("%s worked %d hours and received $%.2f\n",\
                           emp, hours[emp], pay[emp])
               }
      
    The first action uses two arrays, hours and pay, to accumulate for each employee his or her total hours and pay. The name in the first field ($1) is used to index each array. As in C, an assignment statement of the form x += y is a shorthand for x = x + y. When an array entry is first created, it is automatically initialed to zero if it used to hold numeric data.
    The END action uses a form of the for statement that loops over all subscripts that were used to index the array. The loop executes the printf statement with the variable emp set in turn to each different subscript in the array. The order in which the subscripts are considered is implementation dependent. Note that the long printf statement has been broken by a backslash at the end of the line. One possible output from executing this program on the new employee data is
    
       Mary worked 22 hours and received $121.00
       Kathy worked 35 hours and received $171.25
       Susie worked 30 hours and received $145.50
       Mark worked 20 hours and received $100.00
      
    We could pipe the output into sort as we did above to sort the output by employee name.

    1.8 A Handful of Useful "One-liners"

    Although awk can be used to write programs of some complexity, many useful programs are not much more complicated than what we've seen so far. Here is a collection of short programs that you might find handy and/or instructive. Most are variations on material already covered.
    1. Print the total number of input lines:
    2. 
            END { print NR }
        
    3. Print the tenth input line:
    4. 
            NR == 10
        
    5. Print the last field of every input line:
    6. 
            { print $NF }
        
    7. Print the last field of the last input line:
    8. 
                { field = $NF }
            END { print field }
        
    9. Print every unique input line:
    10. 
            !a[$0]++
        

    1.9 Summary

    You have now seen the essentials of awk. Each program in this chapter has been a sequence of pattern-action statements. Awk tests every input line against the patterns, and when a pattern matches, performs the corresponding action. Actions can involve numeric and string comparisons, and actions can include computation and formatted printing. Besides reading through your input files automatically, awk splits each input line into fields. It also provides a number of built-in variables and functions, and lets you define your own as well. With this combination of features, quite a few useful computations can be expressed by short programs -- many of the details that would be needed in another language are handled implicitly in an awk program.

    For more information, consult the man pages for awk and "The AWK Programming Language" book by Aho, Kernighan, and Weinberger.

2. Language Processing Tools


3. Structure of a Compiler


4. Reading



aho@cs.columbia.edu