COMS W4115
Programming Languages and Translators
Lecture 6: Context-Free Languages
February 6, 2012
Lecture Outline
- Pumping lemma for regular languages
- Properties of regular languages
- Context-free grammars
- Derivations and parse trees
- Ambiguity
1. Pumping Lemma for Regular Languages
- The pumping lemma allows us to prove certain languages, like
{
anbn |
n ≥ 0 }, are not regular.
- The pumping lemma. Let L be a regular language. Then there exists a
constant n associated with L such that for every string w in L such that
|w| ≥ n, we can partition w into three strings
xyz (i.e., w = xyz) such that
- y is not the empty string,
- the length of xy is less than or equal to n, and
- for all k ≥ 0, the string xykz is in L.
2. Properties of Regular Languages
- The regular languages are closed under the operators of
- union
- intersection
- complement
- reversal
- Kleene star
- homomorphism
- inverse homomorphism
- Decision properties
- Given a regular expression r and a string w, it is decidable
whether r matches w.
- Give a finite automaton A, it is decidable whether L(A) is empty.
- Given two finite automata A and B, it is decidable whether L(A) = L(B).
3. Context-Free Grammars (CFG's)
- CFG's are very useful for representing the syntactic structure
of programming languages.
- A CFG is sometimes called Backus-Naur Form (BNF).
- A context-free grammar consists of
- A finite set of terminal symbols,
- A finite nonempty set of nonterminal symbols,
- One distinguished nonterminal called the start symbol, and
- A finite set of rewrite rules, called productions, each of the form
A → α
where A is a nonterminal and α is a string (possibly empty)
of terminals and nonterminals.
- Consider the context-free grammar G with the productions
E → E + T | T
T → T * F | F
F → ( E ) | id
- The terminal symbols are the alphabet from which strings are formed.
In this grammar the set of terminal symbols is
{ id, +, *, (, ) }. The terminal symbols are the token names.
- The nonterminal symbols are syntactic variables that denote sets
of strings of terminal symbols. In this grammar the set of nonterminal
symbols is {
E, T, F}.
- The start symbol is
E.
4. Derivations and Parse Trees
- L(G), the language generated by a grammar G, consists of all strings of
terminal symbols that can be derived from the start symbol of G.
- A leftmost derivation expands the leftmost nonterminal in
each sentential form:
E ⇒ E + T
⇒ T + T
⇒ F + T
⇒ id + T
⇒ id + T * F
⇒ id + F * F
⇒ id + id * F
⇒ id + id * id
A rightmost derivation expands the rightmost nonterminal in each sentential form:
E ⇒ E + T
⇒ E + T * F
⇒ E + T * id
⇒ E + F * id
⇒ E + id * id
⇒ T + id * id
⇒ F + id * id
⇒ id + id * id
Note that these two derivations have the same parse tree.
5. Ambiguity
- Consider the context-free grammar G with the productions
E → E + E | E * E | ( E ) | id
This grammar has the following leftmost derivation for
id + id * id
E ⇒ E + E
⇒ id + E
⇒ id + E * E
⇒ id + id * E
⇒ id + id * id
This grammar also has the following leftmost derivation for
id + id * id
E ⇒ E * E
⇒ E + E * E
⇒ id + E * E
⇒ id + id * E
⇒ id + id * id
These derivations have different parse trees.
A grammar is ambiguous if there is a sentence with two
or more parse trees.
The problem is that the grammar above does not specify
- the precedence of the + and * operators, or
- the associativity of the + and * operators
However, the grammar in section (3) generates the same language
and is unambiguous because
it makes * of higher precedence than +, and makes both operators
left associative.
A context-free language is inherently ambiguous if it
cannot be generated by any unambiguous context-free grammar.
The context-free language
{ ambmanbn
| m > 0 and n > 0} ∪
{ ambnanbm
| m > 0 and n > 0}
is inherently ambiguous.
Most (all?) natural languages are inherently ambiguous but no
programming languages are inherently ambiguous.
Unfortunately, there is no algorithm to determine whether a CFG is ambiguous;
that is, the problem of determining whether a CFG is ambiguous is undecidable.
We can, however, give some practically useful sufficient conditions to guarantee that a CFG
is unambiguous.
6. Practice Problems
- Let G be the grammar
S → a S b S | b S a S | ε.
- What language is generated by this grammar?
- Draw all parse trees for the sentence
abab.
- Is this grammar ambiguous?
- Let G be the grammar
S → a S b | ε.
Prove that L(G) =
{
anbn | n ≥ 0 }.
- Consider a sentence of the form
id + id + ... + id where there are
n plus signs. Let G be the grammar in section (5) above.
How many parse trees are there in G for this sentence when n equals
- 1
- 2
- 3
- 4
- m?
- Consider the grammar in section (5) above.
How many sentences does this grammar generate
having n left parentheses where n equals
- 1
- 2
- 3
- 4
- m?
7. Reading
- ALSU, Ch 3 (except Sect. 3.9), Sects. 4.1-4.2
aho@cs.columbia.edu