# COMS W3261 Computer Science Theory Lecture 2: September 10, 2012 Regular Expressions

## Outline

• Review
• Regular expressions
• Examples of regular expressions
• Practice problems

## 1. Review: Operations on Languages

• Union
• LM = { w | w is in either L or M }
• Concatenation
• LM = { xy | x is in L and y is in M }
• Exponentiation
• L0 is { ε }, the set containing the empty string.
• Li = Li-1L, for i > 0
• Kleene closure
• L* = L0L1L2 ...
• Note that ∅* = ∅0 = { ε }.
• Example: {a, b}* is the set of all strings of a's and b's (including ε).

## 2. Regular Expressions

• A regular expression E is an algebraic expression that denotes a language L(E).
• Programming languages such as awk, java, javascript, perl, python use regular expressions to match patterns in strings.
• There are differences in the regular expression notations used by various programming languages, the most common variants being POSIX regular expressions and perl-compatible regular expressions.
• Virtually all regular-expression notations have the operations of union, concatenation, and Kleene closure. We shall call regular expressions with just these three operators Kleene regular expressions.

• ### Kleene regular expressions

• Inductive definition of Kleene regular expressions over an alphabet Σ:
• Basis
• The constants ε and ∅ are regular expressions that denote the languages { ε } and { }, respectively.
• A symbol c in Σ by itself is a regular expression that denotes the language { c }.
• Induction: Let E and F be regular expressions.
• E + F is a regular expression that denotes L(E) ∪ L(F).
• EF is a regular expression that denotes L(E)L(F), the concatenation of L(E) and L(F).
• E* is a regular expression that denotes (L(E))*.
• (E) is a regular expression that denotes L(E).
• If a regular expression E denotes a language L and a string w is in L, we will often say that E matches w.

• Precedence and associativity of the regular-expression operators
• The regular-expression operator star has the highest precedence and is left associative.
• The regular-expression operator concatenation has the next highest precedence and is left associative.
• The regular-expression operator + has the lowest precedence and is left associative.
• Thus the regular expression a + b*c would be grouped a + ((b*)c.

### Examples of Kleene regular expressions and the languages they denote

• 0*10* denotes the set of all strings of 0's and 1's containing a single 1.
• (0+1)*1(0+1)* denotes the set of all strings of 0's and 1's containing at least one 1.
• (a+b)*abba(a+b)* denotes the set of all strings of a's and b's containing the substring abba.

## 3. POSIX Regular Expressions

• The IEEE standards group POSIX added a number of additional operators to Kleene regular expressions to make it easier to specify languages. It also tried to standardize the different regular-expression conventions used by various Unix utilities.
• Here we list some of the more useful Posix regular-expression operators operators and describe the strings they match.
• ### Some POSIX regular expression operators

1. Posix uses `?` to mean "zero or one instance of".
2. The regular expression `a?b?c?` denotes the language {`ε, a, b, c, ab, ac, bc, abc`}.Thus `a?b?c?` matches any of the eight strings in this language.
3. `.` matches any character except a newline.
4. `^` matches the empty string at the beginning of a line.
5. `\$` matches the empty string at the end of a line.
6. `[abc]` matches an `a`, `b`, or `c`.
7. `[a-z]` matches any lowercase letter from `a` to `z`.
8. `[A-Za-z0-9]` matches any alphanumeric character.
9. `[^abc]` matches any character except an `a`, `b`, or `c`.
10. `[^0-9]` matches any nonnumeric character.
11. `a*` matches any string of zero or more `a`'s (including the empty string).
12. `a?` matches any string of zero or one `a`'s (including the empty string).
13. `a{2,5}` matches any string consisting of two to five `a`'s.
14. `(a)` matches an `a`.
15. Note that in POSIX regular expressions the operator `|` (rather than `+`) is used to denote union. In POSIX regular expressions `+` means one or more instances of.
16. `\` is a metacharacter that turns off any special meaning of the following character. For example, `d\*g` matches the string `d*g`. Another example, `\\` matches the string consisting of the single character `\`.

### Examples of Posix regular expressions and the strings they match

• The Unix command `egrep 'regexp' file` prints all lines in `file` that contain a substring matched by the regular expression `regexp`. Examples:
1. The command `egrep 'dog' file` would print all lines in `file` containing the substring `dog`.
2. The command `egrep '^a?b?c?d?e?\$' file` would print all lines in `file` consisting of the letters `a, b, c, d, e ` in increasing alphabetic order. The metacharacters `^` and `\$` match the empty string at the beginning and end of a line, respectively.
3. `aegilops` is the longest English word whose letters are in increasing alphabetic order.

## 4. Practice Problems

1. Do the two regular expressions (a+b)* and (a*b*)* denote the same language?
2. Write a Kleene regular expression for all strings of a's and b's with an even number of a's.
3. Write a Kleene regular expression for all strings of a's and b's that begin and end with an a.
4. Write a Posix regular expression that matches all English words ending in dous.
5. Write a Posix regular expression that matches all English words with the five vowels a,e,i,o,u in order. (The vowels do not have to be next to one another.)

## 5. References

• HMU: Sects. 3.1, 3.3.1
• http://en.wikipedia.org/wiki/Regular_expression

aho@cs.columbia.edu