[Chapter 26] 26.4 Using Metacharacters in Regular Expressions

26.4 Using Metacharacters in Regular Expressions

There are three important parts to a regular expression:

Anchors are used to specify the position of the pattern in relation to a line of text.
Character sets match one or more characters in a single position.
Modifiers specify how many times the previous character set is repeated.

A simple example that demonstrates all three parts is the regular expression:

^#*

The caret (^) is an anchor that indicates the beginning of the line. The hash mark is a simple character set that matches the single character #. The asterisk (*) is a modifier. In a regular expression it specifies that the previous character set can appear any number of times, including zero. As you will see shortly, this is a useless regular expression (except for demonstrating the syntax!).

There are two main types of regular expressions: simple regular expressions and extended regular expressions. (As we'll see later in the article, the boundaries between the two types have become blurred as regular expressions have evolved.) A few utilities like awk and egrep use the extended regular expression. Most use the simple regular expression. From now on, if I talk about a "regular expression" (without specifying simple or extended), I am describing a feature common to both types.

The commands that understand just simple regular expressions are: vi, sed, grep, csplit, dbx, more, ed, expr, lex, and pg. The utilities awk, nawk, and egrep understand extended regular expressions.

[The situation is complicated by the fact that simple regular expressions have evolved over time, and so there are versions of "simple regular expressions" that support extensions missing from extended regular expressions! Bruce explains the incompatibility at the end of his article. -TOR ]

26.4.1 The Anchor Characters: ^ and $

Most UNIX text facilities are line-oriented. Searching for patterns that span several lines is not easy to do. You see, the end-of-line character is not included in the block of text that is searched. It is a separator. Regular expressions examine the text between the separators. If you want to search for a pattern that is at one end or the other, you use anchors. The caret (^) is the starting anchor, and the dollar sign ($) is the end anchor. The regular expression ^A will match all lines that start with an uppercase A. The expression A$ will match all lines that end with uppercase A. If the anchor characters are not used at the proper end of the pattern, then they no longer act as anchors. That is, the ^ is only an anchor if it is the first character in a regular expression. The $ is only an anchor if it is the last character. The expression $1 does not have an anchor. Neither does 1^. If you need to match a ^ at the beginning of the line or a $ at the end of a line, you must escape the special character by typing a backslash (\) before it. Table 26.1 has a summary.

Table 26.1: Regular Expression Anchor Character Examples
Pattern	Matches
`^A`	An A at the beginning of a line
`A$`	An A at the end of a line
`A`	An A anywhere on a line
`$A`	A `$A` anywhere on a line
^\^	A `^` at the beginning of a line
^^	Same as `^\^`
\$$	A `$` at the end of a line
$$	Same as `\$$`

The use of ^ and $ as indicators of the beginning or end of a line is a convention other utilities use. The vi editor uses these two characters as commands to go to the beginning or end of a line. The C shell uses !^ to specify the first argument of the previous line, and !$ is the last argument on the previous line (article 11.7 explains).

It is one of those choices that other utilities go along with to maintain consistency. For instance, $ can refer to the last line of a file when using ed and sed. cat -v -e (25.6, 25.7) marks ends of lines with a $. You might see it in other programs as well.

26.4.2 Matching a Character with a Character Set

The simplest character set is a character. The regular expression the contains three character sets: t, h, and e. It will match any line that contains the string the, including the word other. To prevent this, put spaces () before and after the pattern: the. You can combine the string with an anchor. The pattern ^From: will match the lines of a mail message (1.33) that identify the sender. Use this pattern with grep to print every address in your incoming mailbox:

$USER
% grep '^From: ' /usr/spool/mail/$USER

$USER	% `grep '^From: ' /usr/spool/mail/$USER`

Some characters have a special meaning in regular expressions. If you want to search for such a character as itself, escape it with a backslash (\).

26.4.3 Match any Character with . (Dot)

The dot (.) is one of those special metacharacters. By itself it will match any character, except the end-of-line character. The pattern that will match a line with any single character is: ^.$.

26.4.4 Specifying a Range of Characters with [...]

If you want to match specific characters, you can use square brackets, [], to identify the exact characters you are searching for. The pattern that will match any line of text that contains exactly one digit is: ^[0123456789]$. This is longer than it has to be. You can use the hyphen between two characters to specify a range: ^[0-9]$. You can intermix explicit characters with character ranges. This pattern will match a single character that is a letter, digit, or underscore: [A-Za-z0-9_]. Character sets can be combined by placing them next to one another. If you wanted to search for a word that:

started with an uppercase T,
was the first word on a line,
the second letter was a lowercase letter,
was three letters long (followed by a space character ()), and
the third letter was a lowercase vowel,

the regular expression would be: ^T[a-z][aeiou].

[To be specific: A range is a contiguous series of characters, from low to high, in the ASCII chart (51.3). For example, [z-a] is not a range because it's backwards. The range [A-z] does match both uppercase and lowercase letters, but it also matches the six characters that fall between uppercase and lowercase letters in the ASCII chart: [, \, ], ^, _, and `. -JP ]

26.4.5 Exceptions in a Character Set

You can easily search for all characters except those in square brackets by putting a caret (^) as the first character after the left square bracket ([). To match all characters except lowercase vowels use: [^aeiou].

Like the anchors in places that can't be considered an anchor, the right square bracket (]) and dash (-) do not have a special meaning if they directly follow a [. Table 26.2 has some examples.

Table 26.2: Regular Expression Character Set Examples
Regular Expression	Matches
[0-9]	Any digit
[^0-9]	Any character other than a digit
[-0-9]	Any digit or a `-`
[0-9-]	Any digit or a `-`
[^-0-9]	Any character except a digit or a `-`
[]0-9]	Any digit or a `]`
[0-9]]	Any digit followed by a `]`
[0-99-z]	Any digit or any character between 9 and z (51.3)
[]0-9-]	Any digit, a `-`, or a `]`

26.4.6 Repeating Character Sets with `*`

The third part of a regular expression is the modifier. It is used to specify how many times you expect to see the previous character set. The special character * (asterisk) matches zero or more copies. That is, the regular expression 0* matches zero or more zeros, while the expression [0-9]* matches zero or more digits.

This explains why the pattern ^#* is useless, as it matches any number of #'s at the beginning of the line, including zero. Therefore, this will match every line, because every line starts with zero or more #'s.

At first glance, it might seem that starting the count at zero is stupid. Not so. Looking for an unknown number of characters is very important. Suppose you wanted to look for a digit at the beginning of a line, and there may or may not be spaces before the digit. Just use ^* to match zero or more spaces at the beginning of the line. If you need to match one or more, just repeat the character set. That is, [0-9]* matches zero or more digits and [0-9][0-9]* matches one or more digits.

26.4.7 Matching a Specific Number of Sets with \ { and \ }

You cannot specify a maximum number of sets with the * modifier. However, some programs (26.9) recognize a special pattern you can use to specify the minimum and maximum number of repeats. This is done by putting those two numbers between \{ and \}.

Having convinced you that \{ isn't a plot to confuse you, an example is in order. The regular expression to match four, five, six, seven, or eight lowercase letters is: [a-z]\{4,8\}. Any numbers between 0 and 255 can be used. The second number may be omitted, which removes the upper limit. If the comma and the second number are omitted, the pattern must be duplicated the exact number of times specified by the first number.

CAUTION: The backslashes deserve a special discussion. Normally a backslash turns off the special meaning for a character. For example, a literal period is matched by \. and a literal asterisk is matched by \*. However, if a backslash is placed before a <, >, {, }, (, or ) or before a digit, the backslash turns on a special meaning. This was done because these special functions were added late in the life of regular expressions. Changing the meaning of {, }, (, ), <, and > would have broken old expressions. (This is a horrible crime punishable by a year of hard labor writing COBOL programs.) Instead, adding a backslash added functionality without breaking old programs. Rather than complain about the change, view it as evolution.

You must remember that modifiers like * and \{1,5\} only act as modifiers if they follow a character set. If they were at the beginning of a pattern, they would not be modifiers. Table 26.3 is a list of examples, and the exceptions.

Table 26.3: Regular Expression Pattern Repetition Examples
Regular Expression	Matches
*	Any line with a `*`
\*	Any line with a `*`
\\	Any line with a `\`
^*	Any line starting with a `*`
^A*	Any line
^A\*	Any line starting with an A`*`
^AA*	Any line starting with one A
^AA*B	Any line starting with one or more A's followed by a B
^A\{4,8\}B	Any line starting with four, five, six, seven, or eight A's followed by a B
^A\{4,\}B	Any line starting with four or more A's followed by a B
^A\{4\}B	Any line starting with an AAAAB
\{4,8\}	Any line with a {4,8}
A{4,8}	Any line with an A{4,8}

26.4.8 Matching Words with \ < and \ >

Searching for a word isn't quite as simple as it at first appears. The string the will match the word other. You can put spaces before and after the letters and use this regular expression: the. However, this does not match words at the beginning or the end of the line. And it does not match the case where there is a punctuation mark after the word.

There is an easy solution - at least in many versions of ed, ex, and vi. The characters \< and \> are similar to the ^ and $ anchors, as they don't occupy a position of a character. They do anchor the expression between to match only if it is on a word boundary. The pattern to search for the words the and The would be: \<[tT]he\>.

Let's define a "word boundary." The character before the t or T must be either a newline character or anything except a letter, digit, or underscore ( _ ). The character after the e must also be a character other than a digit, letter, or underscore, or it could be the end-of-line character.

26.4.9 Remembering Patterns with \ (, \ ), and \1

Another pattern that requires a special mechanism is searching for repeated words. The expression [a-z][a-z] will match any two lowercase letters. If you wanted to search for lines that had two adjoining identical letters, the above pattern wouldn't help. You need a way to remember what you found and see if the same pattern occurs again. In some programs, you can mark part of a pattern using $ and $. You can recall the remembered pattern with \ followed by a single digit. Therefore, to search for two identical letters, use: $[a-z]$\1. You can have nine different remembered patterns. Each occurrence of $ starts a new pattern. The regular expression to match a five-letter palindrome (e.g., "radar") is: \([a-z]$$[a-z]$[a-z]\2\1. [Some versions of some programs can't handle  in the same regular expression as \1, etc. In all versions of sed, you're safe if you use /( /) on the pattern side of an s command-and/1, etc., on the replacement side . (34.10) -JP ]

26.4.10 Potential Problems

That completes a discussion of simple regular expressions. Before I discuss the extensions that extended expressions offer, I want to mention two potential problem areas.

The /< and /> characters were introduced in the vi editor. The other programs didn't have this ability at that time. Also, the /{min,max/} modifier is new, and earlier utilities didn't have this ability. This makes it difficult for the novice user of regular expressions, because it seems as if each utility has a different convention. Sun has retrofitted the newest regular expression library to all of their programs, so they all have the same ability. If you try to use these newer features on other vendors' machines, you might find they don't work the same way.

The other potential point of confusion is the extent of the pattern matches (26.6). Regular expressions match the longest possible pattern. That is, the regular expression A.*B matches AAB as well as AAAABBBBABCCCCBBBAAAB. This doesn't cause many problems using grep, because an oversight in a regular expression will just match more lines than desired. If you use sed, and your patterns get carried away, you may end up deleting or changing more than you want to.

26.4.11 Extended Regular Expressions

Two programs use extended regular expressions: egrep and awk. [perl uses expressions that are even more extended. -JP ] With these extensions, those special characters preceded by a backslash no longer have special meaning: /{, /}, /<, />, /(, /), as well as /digit. There is a very good reason for this, which I will delay explaining to build up suspense.

The question mark (?) matches zero or one instances of the character set before it, and the plus sign (+) matches one or more copies of the character set. You can't use /{ and /} in extended regular expressions, but if you could, you might consider ? to be the same as /{0,1/} and + to be the same as /{1,/}.

By now, you are wondering why the extended regular expressions are even worth using. Except for two abbreviations, there seem to be no advantages and a lot of disadvantages. Therefore, examples would be useful.

The three important characters in the expanded regular expressions are (, |, and ). Parentheses are used to group expressions; the vertical bar acts an an OR operator. Together, they let you match a choice of patterns. As an example, you can use egrep to print all From: and Subject: lines from your incoming mail:

% egrep '^(From|Subject): ' /usr/spool/mail/$USER

All lines starting with From: or Subject: will be printed. There is no easy way to do this with simple regular expressions. You could try something like ^[FS][ru][ob][mj]e*c*t*: and hope you don't have any lines that start with Sromeet:. Extended expressions don't have the /< and /> characters. You can compensate by using the alternation mechanism. Matching the word "the" in the beginning, middle, or end of a sentence or at the end of a line can be done with the extended regular expression: (^| )the([^a-z]|$). There are two choices before the word: a space or the beginning of a line. Following the word, there must be something besides a lowercase letter or else the end of the line. One extra bonus with extended regular expressions is the ability to use the *, +, and ? modifiers after a (...) grouping. Here are two ways to match "a simple problem", "an easy problem", as well as "a problem"; the second expression is more exact:

% egrep "a[n]? (simple|easy)? ?problem" data
% egrep "a[n]? ((simple|easy) )?problem" data

I promised to explain why the backslash characters don't work in extended regular expressions. Well, perhaps the /{.../} and /<.../> could be added to the extended expressions, but it might confuse people if those characters are added and the /(.../) are not. And there is no way to add that functionality to the extended expressions without changing the current usage. Do you see why? It's quite simple. If ( has a special meaning, then /( must be the ordinary character. This is the opposite of the simple regular expressions, where ( is ordinary and /( is special. The usage of the parentheses is incompatible, and any change could break old programs.

If the extended expression used (...|...) as regular characters, and /(.../|.../) for specifying alternate patterns, then it is possible to have one set of regular expressions that has full functionality. This is exactly what GNU Emacs (32.1) does, by the way-it combines all of the features of regular and extended expressions with one syntax.

- BB


26.3 Understanding Expressions		26.5 Getting Regular Expressions Right