COMS W4115
Programming Languages and Translators
Lecture 3: September 16, 2009
An AWK Tutorial

Overview

Review
An AWK tutorial
Language processing tools
Overview of compilation
Reading

1. An AWK Tutorial

This section provides a suggested example for a language tutorial. For the most part, the material consists of excerpts from the first chapter "An Awk Tutorial" of the book The AWK Programming Language by Aho, Kernighan and Weinberger.

1.1 Getting Started

Awk is a convenient and expressive programming language that can be applied to a wide variety of computing and data-manipulation tasks.

Suppose you have a file called emp.data that contains the name, pay rate in dollars per hour, and the number of hours worked for your employees, one employee record per line, like this:


   Beth  4.00  0
   Dan   3.75  0
   Kathy 4.00  10
   Mark  5.00  20
   Mary  5.50  22
   Susie 4.25  18

Now you want to print the name and pay (rate times hours) for everyone who worked more than zero hours. Just type this command line:


   awk '$3 > 0 { print $1, $2 * $3 }' emp.data

You should get this output:


   Kathy 40
   Mark 100
   Mary 121
   Susie 76.5

The Structure of an AWK Program

In the command line above, the part between the quote characters is the program in the awk programming language. Each awk program is a sequence of one or more pattern-action statements:

   pattern	{ action }
   pattern	{ action }
   . . .

The basic operation of awk is to scan a sequence of input lines one after another, searching for lines that are matched by any of the patterns in the program. Every input line is tested against each of the patterns in turn. For each pattern that matches, the corresponding action is performed. Then the next line is read and the matching starts over. This continues until all the inut is read.

Running an AWK Program

There are several ways to run an awk program. You can type a command line of the form

   awk 'program' input files

to run the program on each of the specifed input files.

You can omit the input files from the command line and just type>/dt>

   awk 'program'

In this case awk will apply the program to whatever you type next on your terminal until you type an end-of-file signal.

This arrangement is convenient when the program is short (a few lines). If the program is long, however, it is more convenient to put it into a separate file, say progfile, and the type the command line

   awk -f progfile optional list of input files

The -f option instructs awk to fetch the program from the named program file. For the rest of this tutorial we will show just the program, not the whole command line.

1.2 Simple Output

There are only two types of data in awk: numbers and strings of characters. The emp.datafile is typical of this kind of information -- a mixture of words and numbers separated by blanks and/or tabs.

Awk reads its input one line at a time and splits each line into fields, where, by default, a field is a sequence of characters that doesn't contain any blanks or tabs. The first field in the current input line is called $1, the second $2, and so forth. The entire line is called $0. The number of fields can vary from line to line.

Often, all we need to do is print some or all of the fields of each line, perhaps performing some calculations. The programs in this section are all of that form.

Printing Every Line

If an action has no pattern, the action is performed for all input lines. The statement print by itself prints the current input line, so the program

   { print }

prints all of its input on the standard output. Since $0 is the whole line,

   { print $0 }

does the same thing.

Printing Certain Fields

More than one item can be printed on the same output line with a single print statement. The program to print the first and third fields of each input line is

   { print $1, $3 }

Expressions separated by a comma in a print statement are, by default, seperated by a single blank when they are printed. Each line produced by print ends with a newline character. Both of these defaults can be changed.

NF, the Number of Fields

It might appear you must always refer to fields as $1, $2, and so on, but any expression can be used after $ to denote a field number; the expression is evaluated and its numeric value is used as the field number. Awk counts the number of fields in the current line and stores the count in a built-in variable called NF. Thus, the program

   { print NF, $1, $NF }

prints the number of fields and the first and last fields of each input line.

Computing and Printing

You can also do computations on the field values and include the results in what is printed. The program

   { print $1, $2 * $3 }

above prints the name and total pay for each employee.

Printing Line Numbers

Awk provides another built-in variable, called NR, that counts the number of lines read so far. We can use NR and $0 to prefix each line of emp.data with its line number:

   { print NR, $0 }

The output looks like this:


   1 Beth  4.00  0
   2 Dan   3.75  0
   3 Kathy 4.00  10
   4 Mark  5.00  20
   5 Mary  5.50  22
   6 Susie 4.25  18

Putting Text in the Output

You can also print words in the midst of fields and computed values:

   { print "total pay for", $1, "is", $2 * $3 }

prints


   total pay for Beth is 0
   total pay for Dan is 0
   total pay for Kathy is 40
   total pay for Mark is 100
   total pay for Mary is 121
   total pay for Susie is 76.5

In the print statement, the text inside the double quotes is printed along with the fields and computed values.

1.3 Fancier Output

The print statement is meant for quick and easy output. To format the output exactly the way you want it, you may have to use the printf statement. As we see in the AWK language reference manual, printf can produce almost any kind of output, but in this section we'll only show a few of its capabilities.

Lining Up Fields

The printf statement has the form

   printf(format, value₁,...,value_n)

where format is a string that contains text to be printed verbatim, interspersed with specifications of how each of the values is to be printed. A specification is a % followed by a few characters that control the format of a value. The first specification tells how value₁ is to be printed, the second how value₂ is to be printed, and so on. Thus, there must be as many % specifications in format as values to be printed.

Here's a program that uses printf to print the total pay for every employee:

   { printf("total pay for %s is $%.2f\n", $1, $2 * $3) }

The specification string in the printf statement contains two % specifications. The first, %s, says to print the first value, $1, as a string of characters, the second, %.2f, says to print the second value, $2*$3, as a number with 2 digits after the decimal point. Everything else in the specification string, including the dollar sign, is printed verbatim; the \n at the end of the string stands for a newline, which causes subsequent output to begin on the next line. With emp.data as input, this program yields:


   total pay for Beth is $0.00
   total pay for Dan is $0.00
   total pay for Kathy is $40.00
   total pay for Mark is $100.00
   total pay for Mary is $121.00
   total pay for Susie is $76.50

With printf, no blanks or newlines are produced automatically; you must create them yourself. Don't forget the \n.

Sorting the Output

Suppose you want to print all the data for each employee, along with his or her pay, sorted in order of increasing pay. The easiest way is to use awk to prefix the total pay to each employee record, and run that output through a sorting program. On Unix, the command line


   awk '{ printf("%6.2f  %s\n", $2 * $3, $0) }' emp.data | sort

pipes the output of awk into the sortcommand, and produces:


     0.00  Beth  4.00  0
     0.00  Dan   3.75  0
    40.00  Kathy 4.00  10
    76.50  Susie 4.25  18
   100.00  Mark  5.00  20
   121.00  Mary  5.50  22

1.4 Selection

Awk patterns are good for selecting interesting lines from the input for further processing. Since a pattern without an action prints all lines matching the pattern, many awk programs consist of nothing more than a single pattern. This section gives some examples of useful patterns.

Selection by Comparison

This program uses a comparison pattern to select the records of employees who earn $5.00 or more per hour, that is, lines in which the second field is greater than or equal to 5:

   $2 >= 5

It selects these lines from emp.data:


   Mark  5.00  20
   Mary  5.50  22

Selection by Computation

The program


   $2 * $3 > 50 { printf("$%.2f for %s\n", $2 * $3, $1) }


  prints the pay of those employees whose total pay exceeds $50:
  
   $100.00 for Mark
   $121.00 for Mary
   $76.50 for Susie
  

  Selection by Text Content
  Besides numeric tests, you can select input lines that contain specific words
      or phrases.  This program prints all lines in which the first field is
      Susie:
     $1 == "Susie"
  
  The operator == tests for equality.  You can also look for text
      containing any of a set of letters, words, and phrases by using patterns
      called regular expressions.  This program prints all lines that contain
      Susie anywhere:
     /Susie/
  
  The output is this line:
  
   Susie  4.25  18
  
  Regular expressions can be used to specify much more elaborate patterns.

  Combinations of Patterns
  Patterns can be combined with parentheses and the logical operators
      &&, ||, and !, which stand for
      AND, OR, and NOT.  The program
     $2 >= 4 || $3 >= 20
  
  prints those lines where $2 is at least 4 or
      $3 is at least 30:

  
   Beth  4.00  0
   Kathy 4.00  10
   Mark  5.00  20
   Mary  5.50  22
   Susie 4.25  18
  
  Lines that satisfy both conditions are printed only once.

  BEGIN and END
  The special pattern BEGIN matches before the first line of the first
      input file is read, and END matches after the last line of the last
      file has been processed.  This program uses BEGIN to print a heading:
     BEGIN  { print "NAME  RATE  HOURS"; print "" }
          { print }
  

  The output is:

  
   NAME  RATE  HOURS
 
   Beth  4.00  0
   Dan   3.75  0
   Kathy 4.00  10
   Mark  5.00  20
   Mary  5.50  22
   Susie 4.25  18
  
  You can put several statements on a single line if you separate them by semicolons.
      Note that print ""  prints a blank line, quite different from just
      plain print, which prints the current line.


  1.5 Computing with AWK
  An action is a sequence of statements separated by newlines or semicolons.
      This section provides examples of statements for performing simple numeric
      and string computations.  In these statements you can use not only the built-in
      variables like NF, but you can create your own variables for
      performing calculations, storing data, and the like.  In awk, user-created
      variables are not declared.
  Counting
  This program uses a variable emp to count employees who have
      worked more than 15 hours:
     $3 > 15  { emp = emp + 1 }
   END      { print emp, "employees worked more than 15 hours" }
  

  For every line in which the third field exceeds 15, the previous value of
      emp is incremented by 1.  With emp.data as input,
      this program yields:

  
   3 employees worked more than 15 hours
  
  Awk variables used as numbers begin life with the value 0, so we didn't
      need to initialize emp.



  To count the number of employees, we can use the built-in variable
      NR, which holds the number of lines read so far; its value
      at the end of all input is the total number of lines read.
  
   END      { print NR, "employees" }
  
  The output is:

  
   6 employees
  

  String Concatenation
  New strings may be created by combining old ones; this operation is called
      concatenation.  The program
  
           { names = names $1 " " }
   END     { print names }
  
  collects all the employee names into a single string, by appending each name
      and a blank to the previous value in the variable names. The
      value of names is printed by the END action:

  
   Beth Dan Kathy Mark Mary Susie
  
  The concatenation operation is represented in an awk program by writing string
      values one after the other.  At every input line, the first statement in the
      program concatenates three strings: the previous value of names,
      the first field, and a blank; it then assigns the resulting string to
      names.  Variables used to store strings begin life holding the
      null string, so in this program names did not have to be
      explicitly initialized.

  Built-in Functions
  We have already seen that awk provides built-in variables that maintain
      frequently used quantities like the number of fields and the input line number.
      Similarly, there are built-in functions for computing other useful values.
      Besides arithmetic functions for square roots, logarithms, random numbers,
      and the like, there are also functions that manipulate text.  One of these is
      length, which counts the number of characters in a string. For example,
      this program computes the length of each person's name:
  
   { print $1, length($1) }
  
  The result:
  
   Beth 4
   Dan 3
   Kathy 5
   Mark 4
   Mary 4
   Susie 5
  

  Counting Lines, Words, and Characters
  This program uses length, NF, and NR
      to count the number of lines, words, and characters in the input. For convenience,
      we'll treat each field as a word.

  
           { nc = nc + length($0) + 1
             nw = nw + NF }
   END     { print NR, "lines, ", nw, "words, ", nc, "characters" }
  

  The file emp.data has

  
   6 lines, 18 words, 77 characters
  

  We have added one for the newline character at the end of each input line,
      since $0 doesn't include it.



  1.6 Control-Flow Statements
  Awk provides statements for making decisions and writing
      loops, mostly modeled on those found in the C programming language. They can only
      be used in actions.
  If-Else Statement
  The following program computes the total and average pay of employees making
      more than $6.00 an hour.  It uses an if to defend against division
      by zero in computing the aveage pay.

  
   $2 > 6  { n = n + 1; pay = pay + $2 * $3 }
   END     { if (n > 0)
                print n, "employees, total pay is", pay,
                "average pay is", pay/n
             else
                 print "no employees are paid more than $6/hour"
           }
  

  In the if-else statement, the condition following the if
      is evaluated.  If it is true, the first print statement is performed.
      Otherwise, the second print statement is performed.  Note that we can
      continue a long statement over several lines by breaking it after a comma.

  1.7 Arrays
  Awk provides arrays for storing groups of related values. Unlike many
      earlier languages, array subscripts in awk are strings of characters.
      This gives awk a capability like the associative memory of SNOBOL4 tables,
      and for this reason, arrays in awk are called associative arrays.
   


   Suppose some employees get a raise and work at two different pay rates:
  
   Beth  4.00  0
   Dan   3.75  0
   Kathy 4.00  10
   Mark  5.00  20
   Mary  5.50  22
   Susie 4.25  18
   Kathy 5.25  25
   Susie 5.75  12
  
  You now want to compute and print for each employee who worked
      the employee's total number of hours and
      total pay. The following awk program does the job:

  
   $3 > 0  { hours[$1] += $3; pay[$1] += $2 * $3 }
   END     { for (emp in hours)
                printf("%s worked %d hours and received $%.2f\n",\
                       emp, hours[emp], pay[emp])
           }
  
  The first action uses two arrays, hours and pay,
      to accumulate for each employee his or her total hours and pay. The name in the
      first field ($1) is used to index each array. As in C, an
      assignment statement of the form x += y is a shorthand
      for x = x + y.  When an array entry is first created, it is
      automatically initialed to zero if it used to hold numeric data.
  The END action uses a form of the for statement
      that loops over all subscripts that were used to index the array. The loop
      executes the printf statement with the variable emp
      set in turn to each different subscript in the array.  The order in which
      the subscripts are considered is implementation dependent.
      Note that the long printf statement has been broken by a backslash
      at the end of the line. One possible
      output from executing this program on the new employee data is

  
   Mary worked 22 hours and received $121.00
   Kathy worked 35 hours and received $171.25
   Susie worked 30 hours and received $145.50
   Mark worked 20 hours and received $100.00
  

  We could pipe the output into sort as we did above to sort the
      output by employee name.



  1.8 A Handful of Useful "One-liners"
  Although awk can be used to write programs of some complexity, many useful
      programs are not much more complicated than what we've seen so far. Here is
      a collection of short programs that you might find handy and/or instructive.
      Most are variations on material already covered.

  
   Print the total number of input lines:
  
      END { print NR }
  

   Print the tenth input line:
  
      NR == 10
  

   Print the last field of every input line:
  
      { print $NF }
  

   Print the last field of the last input line:
  
          { field = $NF }
      END { print field }
  

   Print every unique input line:
  
      !a[$0]++
  


  

  1.9 Summary
  
You have now seen the essentials of awk.  Each program in this chapter
      has been a sequence of pattern-action statements.  Awk tests every input
      line against the patterns, and when a pattern matches, performs the
      corresponding action.  Actions can involve numeric and string comparisons,
      and actions can include computation and formatted printing.  Besides reading
      through your input files automatically, awk splits each input line into fields.
      It also provides a number of built-in variables and functions, and lets you
      define your own as well.  With this combination of features, quite a few
      useful computations can be expressed by short programs -- many of the details
      that would be needed in another language are handled implicitly in an awk
      program.
  

  For more information, consult the man pages for awk and
      "The AWK Programming Language" book by Aho, Kernighan, and Weinberger.



 2. Language Processing Tools
 
  Basic compiler
  Interpreter
  Bytecode interpreter
  Just-in-time compiler
  Linker and loader
  Preprocessor
  Compiler component generators
   
    lex
    yacc
    antlr
   
 



 3. Structure of a Compiler
 
  Front end: analysis
  Back end: synthesis
  IR: Intermediate representation(s)
  Phases
  
    lexical analyzer (scanner)
    syntax analyzer (parser)
    semantic analyzer
    intermediate code generator
    code optimizer
    code generator
    machine-specific code optimizer
  
  Symbol table
  Error handler
 




 4. Reading
 
  ALSU: Chapter 1
  awk man pages
  Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger,
      The AWK Programming Language,
      Addison-Wesley, 1988.
 





aho@cs.columbia.edu

COMS W4115 Programming Languages and Translators Lecture 3: September 16, 2009 An AWK Tutorial

Overview

1. An AWK Tutorial

1.1 Getting Started

The Structure of an AWK Program

Running an AWK Program

1.2 Simple Output

Printing Every Line

Printing Certain Fields

NF, the Number of Fields

Computing and Printing

Printing Line Numbers

Putting Text in the Output

1.3 Fancier Output

Lining Up Fields

Sorting the Output

1.4 Selection

Selection by Comparison

Selection by Computation

Selection by Text Content

Combinations of Patterns

BEGIN and END

1.5 Computing with AWK

Counting

String Concatenation

Built-in Functions

Counting Lines, Words, and Characters

1.6 Control-Flow Statements

If-Else Statement

1.7 Arrays

1.8 A Handful of Useful "One-liners"

1.9 Summary

2. Language Processing Tools

3. Structure of a Compiler

4. Reading

COMS W4115
Programming Languages and Translators
Lecture 3: September 16, 2009
An AWK Tutorial