Computer Science Theory

Lecture 2: September 10, 2012

Regular Expressions

- Review
- Regular expressions
- Examples of regular expressions
- Practice problems

- Union
*L*∪*M*= {*w*|*w*is in either*L*or*M*}- Concatenation
*L**M*= {*xy*|*x*is in*L*and*y*is in*M*}- Exponentiation
*L*^{0}is { ε }, the set containing the empty string.*L*=^{i}*L*^{i}^{-1}*L*, for*i*> 0- Kleene closure
*L** =*L*^{0}∪*L*^{1}∪*L*^{2}...- Note that ∅* = ∅
^{0}= { ε }. - Example: {a, b}* is the set of all strings of a's and b's (including ε).

- A regular expression
*E*is an algebraic expression that denotes a language*L*(*E*). - Programming languages such as awk, java, javascript, perl, python use regular expressions to match patterns in strings.
- There are differences in the regular expression notations used by various programming languages, the most common variants being POSIX regular expressions and perl-compatible regular expressions.
- Virtually all regular-expression notations have the operations of union, concatenation, and Kleene closure. We shall call regular expressions with just these three operators Kleene regular expressions.
- Inductive definition of Kleene regular expressions over an alphabet Σ:
- Basis
- The constants ε and ∅ are regular expressions that denote the languages { ε } and { }, respectively.
- A symbol
*c*in Σ by itself is a regular expression that denotes the language {*c*}. - Induction: Let
*E*and*F*be regular expressions. *E*+*F*is a regular expression that denotes*L*(*E*) ∪*L*(*F*).*E**F*is a regular expression that denotes*L*(*E*)*L*(*F*), the concatenation of*L*(*E*) and*L*(*F*).*E** is a regular expression that denotes (*L*(*E*))*.- (
*E*) is a regular expression that denotes*L*(*E*). - If a regular expression
*E*denotes a language*L*and a string*w*is in*L*, we will often say that*E matches w*. - Precedence and associativity of the regular-expression operators
- The regular-expression operator star has the highest precedence and is left associative.
- The regular-expression operator concatenation has the next highest precedence and is left associative.
- The regular-expression operator + has the lowest precedence and is left associative.
- Thus the regular expression a + b*c would be grouped a + ((b*)c.
- 0*10* denotes the set of all strings of 0's and 1's containing a single 1.
- (0+1)*1(0+1)* denotes the set of all strings of 0's and 1's containing at least one 1.
- (a+b)*abba(a+b)* denotes the set of all strings of a's and b's containing the substring abba.

- The IEEE standards group POSIX added a number of additional operators to Kleene regular expressions to make it easier to specify languages. It also tried to standardize the different regular-expression conventions used by various Unix utilities.
- Here we list some of the more useful Posix regular-expression operators operators and describe the strings they match.
- Posix uses
`?`

to mean "zero or one instance of". - The regular expression
`a?b?c?`

denotes the language {`ε, a, b, c, ab, ac, bc, abc`

}.Thus`a?b?c?`

matches any of the eight strings in this language. `.`

matches any character except a newline.`^`

matches the empty string at the beginning of a line.`$`

matches the empty string at the end of a line.`[abc]`

matches an`a`

,`b`

, or`c`

.`[a-z]`

matches any lowercase letter from`a`

to`z`

.`[A-Za-z0-9]`

matches any alphanumeric character.`[^abc]`

matches any character except an`a`

,`b`

, or`c`

.`[^0-9]`

matches any nonnumeric character.`a*`

matches any string of zero or more`a`

's (including the empty string).`a?`

matches any string of zero or one`a`

's (including the empty string).`a{2,5}`

matches any string consisting of two to five`a`

's.`(a)`

matches an`a`

.- Note that in POSIX regular expressions the operator
`|`

(rather than`+`

) is used to denote union. In POSIX regular expressions`+`

means one or more instances of. `\`

is a metacharacter that turns off any special meaning of the following character. For example,`d\*g`

matches the string`d*g`

. Another example,`\\`

matches the string consisting of the single character`\`

.- The Unix command
`egrep 'regexp' file`

prints all lines in`file`

that contain a substring matched by the regular expression`regexp`

. Examples: - The command
`egrep 'dog' file`

would print all lines in`file`

containing the substring`dog`

. - The command
`egrep '^a?b?c?d?e?$' file`

would print all lines in`file`

consisting of the letters`a, b, c, d, e`

in increasing alphabetic order. The metacharacters`^`

and`$`

match the empty string at the beginning and end of a line, respectively. `aegilops`

is the longest English word whose letters are in increasing alphabetic order.

- Do the two regular expressions (a+b)* and (a*b*)* denote the same language?
- Write a Kleene regular expression for all strings of a's and b's with an even number of a's.
- Write a Kleene regular expression for all strings of a's and b's that begin and end with an a.
- Write a Posix regular expression that matches all English words ending in dous.
- Write a Posix regular expression that matches all English words with the five vowels a,e,i,o,u in order. (The vowels do not have to be next to one another.)

- HMU: Sects. 3.1, 3.3.1
- http://en.wikipedia.org/wiki/Regular_expression

aho@cs.columbia.edu