COMS W3261
Computer Science Theory
Lecture 2: September 14, 2009
Regular Expressions
1. Outline
- Review
- Two inductive proofs
- Regular expressions
- Examples of regular expressions
- Practice problems
2. Review
- Operations on languages
- Union
- L ∪ M = { w | w is in either L
or M }
- Concatenation
- LM = { xy | x is in L
and y is in M }
- Exponentiation
- L0 is { ε }, the set containing the empty string.
- Li = Li-1L, for i > 0
- Example: {0, 1}* is the set of all strings of 0's and 1's (including ε).
- Kleene closure
- L* = L0 ∪ L1 ∪
L2 ...
- Note that ∅* = ∅0 = { ε }.
3. Two inductive proofs
- Let x be a string of left and right parentheses.
The profile of x is the number of left parentheses in x minus
the number of right parenthesis in x.
- Definition P. A string w of left and right parentheses is
profile-balanced if the profile of w is zero and
the profile of any prefix of w is nonnegative.
- Definition B. Balanced strings.
- B1: The empty string is balanced.
- B2: If x and y are balanced strings, then (x)y
is a balanced string.
- We want to show L(B) = L(P).
- We first show L(B) is contained in L(P)
by proving the following statement S(n) is true by induction
on n for n ≥ 0.
- S(n): If w is a string generated by n
applications of rule B2 in the definition B, then w is
profile-balanced.
-
- Basis: n = 0. The only string that can be generated by zero
applications of rule B2 is the empty string, which is clearly
profile-balanced.
- Induction: Suppose S(i) is true for i = 0, 1, 2,..., n.
Consider an instance of S(n + 1) which generates a string
w using n + 1 applications of rule B2.
- The last application of rule B2 that generates w
says w is of the form (x)y where x
and y are balanced.
- Both x and y were generated by n or fewer applications
of rule B2, so by the inductive hypothesis both x and y
are profile-balanced. It now follows, (x)y is profile-balanced.
- We therefore conclude S(n) is true for all n ≥ 0.
- We now show L(P) is contained in L(B)
by proving the following statement T(n) is true by induction
on n for n ≥ 0.
- T(n): If w is a profile-balanced string of length n,
then w balanced.
- Basis: n = 0. Here, w is the empty string, which is
balanced by rule B1.
- Induction: Suppose all profile-balanced strings of length up to n
are balanced. Consider a profile-balanced string w of length
n + 1.
- We can write w as (x)y where (x)
is the shortest prefix of w that is profile-balanced.
It now follows that y must be profile-balanced.
- By the inductive hypothesis both x and y
are balanced since each of their lengths is less than n.
By rule B2, (x)y is balanced.
- We therefore conclude T(n) is true for all n ≥ 0.
- We have now shown L(B) = L(P). Thus, we can use either
P or B to define what it means for a string of parentheses to
have balanced parentheses.
4. Regular Expressions
- A regular expression E is an algebraic expression that denotes a language
L(E).
- Programming languages such as awk, java,
javascript, perl, python use regular expressions to match patterns in strings.
- Regular expressions come in many different forms but almost all have the operations
of union, concatenation, and Kleene closure.
- Inductive definition of regular expressions over an
alphabet Σ:
- Basis
- The constants ε and ∅ are regular expressions that denote
the languages { ε } and { }, respectively.
- A symbol c in Σ by itself is a regular expression that denotes the
language { c }.
- Induction: Let E and F be regular expressions.
- E + F is
a regular expression that denotes L(E) ∪ L(F).
- EF is
a regular expression that denotes L(E)L(F),
the concatenation of L(E) and L(F).
- E* is
a regular expression that denotes (L(E))*.
- (E) is
a regular expression that denotes L(E).
- Precedence and associativity of the regular-expression operators
- The regular-expression operator star has the highest precedence and is
left associative.
- The regular-expression operator concatenation has the next highest precedence and is
left associative.
- The regular-expression operator + has the lowest precedence and is
left associative.
- Thus the regular expression a + b*c would be grouped a + ((b*)c.
5. Examples of Regular Expressions and the Languages They Denote
- 0*10* denotes the set of all strings of 0's and 1's containing a single 1.
- (0+1)*1(0+1)* denotes the set of all strings of 0's and 1's containing at
least one 1.
- (a+b)*abba(a+b)* denotes the set of all strings of a's and b's
containing the substring abba.
- Let R? be a shorthand for the regular expression (R + ε).
- We can interpret the operator ? as meaning "zero or one of".
- The regular expression a?b?c? denotes the language
{ε, a, b, c, ab, ac, bc, abc}.
- The Unix command
egrep '^a?b?c?d?e?$' file
- would print all lines in
file consisting of the letters
a, b, c, d, e in
increasing alphabetic order.
- The metacharacters
^ and $ match the
empty string at the beginning and end of a line, respectively.
aegilops is the longest English word whose letters are in increasing
alphabetic order.
6. Practice Problems
- Do the two regular expressions (a+b)* and (a*b*)* denote the same language?
- Write a regular expression for all strings of a's and b's with an
even number of a's.
- Write a regular expression for all strings of a's and b's with an
even number of a's and an odd number of b's.
- Write a regular expression for all strings of a's and b's that do not
contain aba as a substring.
- Write a regular expression for all strings of a's, b's, and c's that do
not contain two identical adjacent characters.
- Write a Unix regular expression for all English words ending in dous.
- Write a Unix regular expression for all English words with the five vowels
a,e,i,o,u in order.
(The vowels do not have to be next to one another.)
7. Reading Assignment
aho@cs.columbia.edu