# COMS W3261 Computer Science Theory Lecture 7: September 26, 2012 Context-Free Grammars

## Outline

• Review
• Definition of a context-free grammar
• Derivations
• Leftmost and rightmost derivations
• Parse trees
• Ambiguity

## 1. Review

• Closure properties of regular languages
• Decision problems for regular languages
• Testing equivalence of states
• Testing equivalence of DFA's
• Minimizing the number of states in a DFA

## 2. Definition of a Context-Free Grammar (CFG)

• A CFG is a formalism for defining a language.
• A CFG has four components (V, T, P, S):
• V is a finite set of variables called nonterminals, sometimes called syntactic categories.
• Each variable represents a language.
• T is a finite set of symbols called terminals.
• The set of terminals is the alphabet of the language defined by the grammar.
• P is a finite set of productions, rewrite rules of the form
• `A → α`
where A is a nonterminal and α is a string (possibly empty) of nonterminals and terminals.
• S is a nonterminal, called the start symbol.
• Example grammar G1:
1. V = { `S` }
2. T = { ( , ) }
3. P is the set with the two productions
```S → S ( S )
S → ε```
4. S is the start symbol.
G1 generates the language consisting of all strings of balanced parentheses.

## 3. Derivations

• A grammar is used to define a language.
• Example of a derivation of `( )( )` from `S` in G1:
• ```S ⇒ S ( S )
⇒ S ( S ) ( S )
⇒ ( S ) ( S )
⇒ ( ) ( S )
⇒ ( ) ( )```
• This derivation shows that `( )( )` is string in the language defined by G1.
• L(G), the set of all strings of terminals that can be derived from the start symbol of a grammar G, is the language defined by G.
• We often call a string in L(G) a sentence of L(G).
• A string of terminals and nonterminals that can be derived from the start symbol of a grammar is called a sentential form.

## 4. Leftmost and Rightmost Derivations

• A derivation in which at each step we replace the leftmost nonterminal by one of its production bodies is called a leftmost derivation.
• The derivation above is a leftmost derivation of `( )( )` from `S` in G1.
• A rightmost derivation is one in which at each step we replace the rightmost nonterminal by one of its production bodies.
• Here is a rightmost derivation of `( )( )` from `S` in G1:
• ```S ⇒ S ( S )
⇒ S (  )
⇒ S ( S ) ( )
⇒ S ( ) ( )
⇒ ( ) ( )```

## 5. Parse Trees

• A derivation can be represented by a parse tree.
• Let G = (V, T, P, S) be a CFG. A parse tree for G is a tree in which:
• Each interior node is labeled by a nonterminal in V.
• Each leaf is labeled by a nonterminal, or a terminal, or ε
• If an interior node is labeled by a nonterminal A and its children are labeled X1, X2, ... , Xk, then A → X1X2 ... Xk is a production in P.
• The yield of a parse tree is the string obtained by concatenating the labels of the leaves from the left.
• Derivations, parse trees, leftmost derivations, rightmost derivations, and recursive inference are equivalent.
• A parser for a grammar G is a program that takes as input a string and produces as output a parse tree for the string or a message saying that the string cannot be generated by G.
• A parser generator is a program that takes as input a grammar G and produces as output a parser for G. YACC is a widely used parser generator.

## 6. Ambiguity

• A grammar G is ambiguous if there is a sentence in L(G) with two or more distinct parse trees.
• The following grammar G2 for arithmetic expressions is ambiguous because `a + a * a` has two parse trees.
• `E → E + E | E * E | ( E ) | a`
• We can remove the ambiguity by specifying the associativity and precedence of the `+` and `*`.
• The grammar G3 below is unambiguous and makes `*` have higher precedence than `+` and makes both `*` and `+` left associative.
• ```E → E + T | T
T → T * F | F
F → ( E ) | a```
• A context-free language L is inherently ambiguous if it cannot be generated by an unambiguous grammar.

## 7. Practice Problems

1. Construct a CFG that generates the language { `anbn` | n ≥ 0 }.
2. Prove that the language generated by the grammar G1 in section 2 consists of all and only all strings of balanced parentheses.
3. Construct a CFG that generates ELP = { `wwR` | `w` is any string of `a`'s and `b`'s }. This is the language of even-length palindromes over the alphabet {`a`, `b`}. A palindrome is a string that reads the same in both directions.
4. Prove that ELP is not a regular language.
5. Construct a CFG for all regular expressions over the alphabet {a, b}.