There are three important parts to a regular expression:
Anchors are used to specify the position of the pattern in relation to a line of text.
Character sets match one or more characters in a single position.
Modifiers specify how many times the previous character set is repeated.
A simple example that demonstrates all three parts is the regular expression:
^#*
The caret (^
) is an anchor that indicates the beginning of the line.
The hash mark
is a simple character set that matches the
single character
#
.
The asterisk (*
) is a modifier.
In a regular expression it specifies that the previous character set
can appear any number of times, including zero.
As you will see shortly, this is a useless regular expression
(except for demonstrating the syntax!).
There are two main types of regular expressions: simple regular expressions and extended regular expressions. (As we'll see later in the article, the boundaries between the two types have become blurred as regular expressions have evolved.) A few utilities like awk and egrep use the extended regular expression. Most use the simple regular expression. From now on, if I talk about a "regular expression" (without specifying simple or extended), I am describing a feature common to both types.
The commands that understand just simple regular expressions are: vi, sed, grep, csplit, dbx, more, ed, expr, lex, and pg. The utilities awk, nawk, and egrep understand extended regular expressions.
[The situation is complicated by the fact that simple regular expressions have evolved over time, and so there are versions of "simple regular expressions" that support extensions missing from extended regular expressions! Bruce explains the incompatibility at the end of his article. -TOR ]
Most UNIX text facilities are line-oriented. Searching for patterns
that span several lines is not easy to do.
You see, the end-of-line character is not included in the block of
text that is searched.
It is a separator.
Regular expressions examine the text between the separators.
If you want to search for a pattern that is at one end or the other,
you use
anchors.
The caret (^
)
is the starting anchor, and the
dollar sign ($
)
is the end anchor.
The regular expression ^A
will match all lines that start with an uppercase A.
The expression
A$
will match all lines that end with uppercase A.
If the anchor characters are not used at the proper end of the
pattern, then they no longer act as anchors.
That is, the
^
is only an anchor if it is the first character in a regular
expression.
The
$
is only an anchor if it is the last character.
The expression
$1
does not have an anchor.
Neither does
1^
.
If you need to match a
^
at the beginning of the line or a
$
at the end of a line, you must
escape
the special character by typing a backslash (\
) before it.
Table 26.1
has a summary.
Pattern | Matches |
---|---|
^A | An A at the beginning of a line |
A$ | An A at the end of a line |
A | An A anywhere on a line |
$A | A $A anywhere on a line |
^\^ | A ^ at the beginning of a line |
^^ | Same as ^\^ |
\$$ | A $ at the end of a line |
$$ | Same as \$$ |
The use of
^
and
$
as indicators of the beginning or end of a line is a convention
other utilities use.
The
vi
editor uses these two characters as commands to go to the beginning or
end of a line.
The C shell uses
!^
to specify the first argument of the previous line, and
!$
is the last argument on the previous line
(article
11.7
explains).
It is one of those choices that other utilities go along with to
maintain consistency.
For instance,
$
can refer to the last line of a file when using
ed
and
sed.
cat -v -e (25.6, 25.7)
marks ends of lines with a
$
.
You might see it in other programs as well.
The simplest character set is a character.
The regular expression
the
contains three character sets:
t
,
h
,
and
e
.
It will match any line that contains the string
the
,
including the word
other
.
To prevent this, put spaces () before and after the pattern:
the
.
You can combine the string with an anchor.
The pattern
^From:
will match the lines of a
mail message (1.33)
that identify the sender.
Use this pattern with grep to print every address in your incoming mailbox:
$USER |
% |
---|
Some characters have a special meaning in regular expressions.
If you want to search for such a character as itself, escape it with a
backslash (\
).
The dot (.
)
is one of those special metacharacters.
By itself it will match any character, except the end-of-line
character.
The pattern that will match a line with any single character is:
^.$
.
If you want to match specific characters, you can use
square brackets, []
, to identify the exact characters you are searching for.
The pattern that will match any line of text that contains exactly one
digit is:
^[0123456789]$
.
This is longer than it has to be.
You can use the hyphen between two characters to specify a range:
^[0-9]$
.
You can intermix explicit characters with character ranges.
This pattern will match a single character that is a letter, digit,
or underscore:
[A-Za-z0-9_]
.
Character sets can be combined by placing them next to one another.
If you wanted to search for a word that:
started with an uppercase T,
was the first word on a line,
the second letter was a lowercase letter,
was three letters long (followed by a space character ()), and
the third letter was a lowercase vowel,
the regular expression would be:
^T[a-z][aeiou]
.
[To be specific:
A range is a contiguous series of characters, from low to high, in the
ASCII chart (51.3).
For example, [z-a]
is not a range because it's backwards.
The range [A-z]
does match both uppercase and lowercase letters,
but it also matches the six characters that fall between uppercase
and lowercase letters in the ASCII chart:
[
, \
, ]
, ^
, _
, and `
.
-JP ]
You can easily search for all characters except those in square
brackets by putting a
caret (^
)
as the first character after the
left square bracket ([
).
To match all characters except lowercase vowels use:
[^aeiou]
.
Like the anchors in places that can't be considered an anchor, the
right square bracket (]
)
and
dash (-
)
do not have a special meaning if they directly follow
a [
.
Table 26.2
has some examples.
Regular Expression | Matches |
---|---|
[0-9] | Any digit |
[^0-9] | Any character other than a digit |
[-0-9] | Any digit or a - |
[0-9-] | Any digit or a - |
[^-0-9] | Any character except a digit or a - |
[]0-9] | Any digit or a ] |
[0-9]] | Any digit followed by a ] |
[0-99-z] | Any digit or any character between 9 and z (51.3) |
[]0-9-] | Any digit, a - , or a ] |
*
The third part of a regular expression is the modifier.
It is used to specify how many times you expect to see the previous
character set. The special character *
(asterisk)
matches
zero or more
copies.
That is, the regular expression
0*
matches
zero or more zeros,
while the expression
[0-9]*
matches zero or more digits.
This explains why the pattern
^#*
is useless, as it matches any number of
#
's
at the beginning of the line, including
zero.
Therefore, this will match every line, because every line starts with
zero or more
#
's.
At first glance, it might seem that starting the count at zero is
stupid.
Not so.
Looking for an unknown number of characters is very important.
Suppose you wanted to look for a digit at the beginning of a line,
and there may or may not be spaces before the digit.
Just use ^
*
to match zero or more spaces at the beginning of the line.
If you need to match one or more, just repeat the character set.
That is,
[0-9]*
matches zero or more digits and
[0-9][0-9]*
matches one or more digits.
You cannot specify a maximum number of sets
with the
*
modifier.
However,
some programs (26.9)
recognize a
special pattern you can use to specify the
minimum and maximum number of repeats.
This is done by putting those two numbers between
\{
and
\}
.
Having convinced you that
\{
isn't a plot to confuse you, an example is in order. The regular
expression to match four, five, six, seven, or eight lowercase letters is:
[a-z]\{4,8\}
.
Any numbers between 0 and 255 can be used.
The second number may be omitted, which removes the upper limit.
If the comma and the second number are omitted, the pattern must be
duplicated the exact number of times specified by the first number.
CAUTION: The backslashes deserve a special discussion. Normally a backslash turns off the special meaning for a character. For example, a literal period is matched by
\.
and a literal asterisk is matched by\*
. However, if a backslash is placed before a<
,>
,{
,}
,(
, or)
or before a digit, the backslash turns on a special meaning. This was done because these special functions were added late in the life of regular expressions. Changing the meaning of{
,}
,(
,)
,<
, and>
would have broken old expressions. (This is a horrible crime punishable by a year of hard labor writing COBOL programs.) Instead, adding a backslash added functionality without breaking old programs. Rather than complain about the change, view it as evolution.
You must remember that modifiers like
*
and
\{1,5\}
only act as modifiers if they follow a character set.
If they were at the beginning of a pattern, they would not be modifiers.
Table 26.3
is a list of examples, and the exceptions.
Regular Expression | Matches |
---|---|
* | Any line with a * |
\* | Any line with a * |
\\ | Any line with a \ |
^* | Any line starting with a * |
^A* | Any line |
^A\* | Any line starting with an A* |
^AA* | Any line starting with one A |
^AA*B | Any line starting with one or more A's followed by a B |
^A\{4,8\}B | Any line starting with four, five, six, seven, or eight A's followed by a B |
^A\{4,\}B | Any line starting with four or more A's followed by a B |
^A\{4\}B | Any line starting with an AAAAB |
\{4,8\} | Any line with a {4,8} |
A{4,8} | Any line with an A{4,8} |
Searching for a word isn't quite as simple as it at first appears.
The string
the
will match the word
other
.
You can put spaces before and after the letters and use this regular
expression:
the
.
However, this does not match words at the beginning or the end of the line.
And it does not match the case where there is a punctuation mark
after the word.
There is an easy solution - at least in many versions of ed, ex, and
vi.
The characters
\<
and
\>
are similar to the
^
and
$
anchors,
as they don't occupy a position of a character.
They do
anchor
the expression between to match only if it is on a word boundary.
The pattern to search for the words
the
and The
would be:
\<[tT]he\>
.
Let's define a "word boundary."
The character before the
t
or T
must be either a newline character or anything except a letter,
digit, or underscore ( _
).
The character after the
e
must
also be a character other than a digit, letter, or underscore,
or it could be the end-of-line character.
Another pattern that requires a special mechanism is searching for
repeated words.
The expression
[a-z][a-z]
will match any two lowercase letters.
If you wanted to search for lines that had two adjoining identical
letters, the above pattern wouldn't help.
You need a way to remember what you found and see if
the same pattern occurs again.
In some programs,
you can mark part of a pattern using
\(
and
\)
.
You can recall the remembered pattern with
\
followed by a single digit.
Therefore, to search for two identical letters, use:
\([a-z]\)\1
.
You can have nine different remembered patterns.
Each occurrence of
\(
starts a new pattern.
The regular expression to match a five-letter palindrome
(e.g., "radar") is:
\([a-z]\)\([a-z]\)[a-z]\2\1
.
[Some versions of some programs can't handle \( \)
in the same
regular expression as \
1
, etc.
In all versions of sed, you're safe if you use
/( /)
on the pattern side of an s command-and/
1
, etc., on the replacement side . (34.10)
-JP ]
That completes a discussion of simple regular expressions. Before I discuss the extensions that extended expressions offer, I want to mention two potential problem areas.
The
/<
and
/>
characters were introduced in the
vi
editor. The other programs didn't have this ability at that time.
Also, the
/{
min
,
max
/}
modifier is new, and earlier utilities didn't have this ability.
This makes it difficult for the novice user of regular expressions,
because it seems as if each utility has a different convention.
Sun has retrofitted the newest regular expression library to all of
their programs, so they all have the same ability.
If you try to use these newer features on other vendors' machines, you
might find they don't work the same way.
The other potential point of confusion is the
extent of the pattern matches (26.6).
Regular expressions match the longest possible pattern.
That is, the regular expression
A.*B
matches
AAB
as well as
AAAABBBBABCCCCBBBAAAB
.
This doesn't cause many problems using
grep,
because an oversight in a regular expression will just match more
lines than desired.
If you use
sed,
and your patterns get carried away, you may end up deleting or
changing more than you want to.
Two programs use extended regular expressions:
egrep
and
awk.
[perl uses expressions that are even more extended. -JP ]
With these extensions, those special characters preceded by a backslash
no longer have special meaning:
/{
,
/}
,
/<
,
/>
,
/(
,
/)
,
as well as
/
digit
.
There is a very good reason for this, which I will
delay explaining to build up suspense.
The
question mark (?
)
matches zero or one instances of the character set before it, and the
plus sign (+
)
matches one or more copies of the character set.
You can't use /{
and /}
in extended regular expressions,
but if you could, you might consider
?
to be the same as
/{0,1/}
and
+
to be the same as
/{1,/}
.
By now, you are wondering why the extended regular expressions are even worth using. Except for two abbreviations, there seem to be no advantages and a lot of disadvantages. Therefore, examples would be useful.
The three important characters in the expanded regular expressions are
(
,
|
,
and
)
.
Parentheses are used to group expressions; the vertical bar acts an
an OR operator.
Together, they let you match a
choice
of patterns.
As an example, you can
use egrep
to print all
From:
and
Subject:
lines from your incoming mail:
%egrep '^(From|Subject): ' /usr/spool/mail/$USER
All lines starting with
From:
or
Subject:
will be printed. There is no easy way to do this with simple
regular expressions. You could try something like
^[FS][ru][ob][mj]e*c*t*:
and hope you don't have any lines that start with
Sromeet:
.
Extended expressions don't have
the
/<
and
/>
characters.
You can compensate by using the alternation mechanism.
Matching the word
"the"
in the beginning, middle, or end of a sentence or at the end of a line can be
done with the extended regular expression:
(^| )the([^a-z]|$)
.
There are two choices before the word: a space or the beginning of a
line.
Following the word, there must be something besides a lowercase letter or
else the end of the line.
One extra bonus with extended regular expressions is the ability to
use the
*
,
+
,
and
?
modifiers after a
(...)
grouping.
Here are two ways to match
"a simple problem",
"an easy problem",
as well as
"a problem";
the second expression is more exact:
%egrep "a[n]? (simple|easy)? ?problem" data
%egrep "a[n]? ((simple|easy) )?problem" data
I promised to explain why the backslash characters don't work in
extended regular expressions.
Well, perhaps the
/{.../}
and
/<.../>
could be added to the extended expressions, but
it might confuse people if those characters are added and the
/(.../)
are not. And there is no way to add that functionality to the extended
expressions without changing the current usage. Do you see why?
It's quite simple. If
(
has a special meaning, then
/(
must be the ordinary character.
This is the opposite of the simple regular expressions,
where
(
is ordinary and
/(
is special.
The usage of the parentheses is incompatible, and any change could
break old programs.
If the extended expression used
(...|...)
as regular characters, and
/(.../|.../)
for specifying alternate patterns, then it is possible to have one set
of regular expressions that has full functionality.
This is exactly
what
GNU Emacs (32.1)
does, by the way-it combines
all of the features of regular and
extended expressions with one syntax.
-