Next: Leftmost Longest, Previous: Regexp Operators, Up: Regexp [Contents][Index]
As mentioned earlier, a bracket expression matches any character among those listed between the opening and closing square brackets.
Within a bracket expression, a range expression consists of two
characters separated by a hyphen. It matches any single character that
sorts between the two characters, based upon the system’s native character
set. For example, ‘[0-9]’ is equivalent to ‘[0123456789]’.
(See Regexp Ranges and Locales: A Long Sad Story for an explanation of how the POSIX
standard and gawk
have changed over time. This is mainly
of historical interest.)
With the increasing popularity of the Unicode character standard, there is an additional wrinkle to consider. Octal and hexadecimal escape sequences inside bracket expressions are taken to represent only single-byte characters (characters whose values fit within the range 0–256). To match a range of characters where the endpoints of the range are larger than 256, enter the multibyte encodings of the characters directly.
To include one of the characters ‘\’, ‘]’, ‘-’, or ‘^’ in a bracket expression, put a ‘\’ in front of it. For example:
[d\]]
matches either ‘d’ or ‘]’. Additionally, if you place ‘]’ right after the opening ‘[’, the closing bracket is treated as one of the characters to be matched.
The treatment of ‘\’ in bracket expressions
is compatible with other awk
implementations and is also mandated by POSIX.
The regular expressions in awk
are a superset
of the POSIX specification for Extended Regular Expressions (EREs).
POSIX EREs are based on the regular expressions accepted by the
traditional egrep
utility.
Character classes are a feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but the actual characters can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs between the United States and France.
A character class is only valid in a regexp inside the brackets of a bracket expression. Character classes consist of ‘[:’, a keyword denoting the class, and ‘:]’. Table 3.1 lists the character classes defined by the POSIX standard.
Class | Meaning |
---|---|
[:alnum:] | Alphanumeric characters |
[:alpha:] | Alphabetic characters |
[:blank:] | Space and TAB characters |
[:cntrl:] | Control characters |
[:digit:] | Numeric characters |
[:graph:] | Characters that are both printable and visible (a space is printable but not visible, whereas an ‘a’ is both) |
[:lower:] | Lowercase alphabetic characters |
[:print:] | Printable characters (characters that are not control characters) |
[:punct:] | Punctuation characters (characters that are not letters, digits, control characters, or space characters) |
[:space:] | Space characters (these are: space, TAB, newline, carriage return, formfeed and vertical tab) |
[:upper:] | Uppercase alphabetic characters |
[:xdigit:] | Characters that are hexadecimal digits |
For example, before the POSIX standard, you had to write /[A-Za-z0-9]/
to match alphanumeric characters. If your
character set had other alphabetic characters in it, this would not
match them.
With the POSIX character classes, you can write
/[[:alnum:]]/
to match the alphabetic
and numeric characters in your character set.
Some utilities that match regular expressions provide a nonstandard
‘[:ascii:]’ character class; awk
does not. However, you
can simulate such a construct using ‘[\x00-\x7F]’. This matches
all values numerically between zero and 127, which is the defined
range of the ASCII character set. Use a complemented character list
(‘[^\x00-\x7F]’) to match any single-byte characters that are not
in the ASCII range.
NOTE: Some older versions of Unix
awk
treat[:blank:]
like[:space:]
, incorrectly matching more characters than they should. Caveat Emptor.
Two additional special sequences can appear in bracket expressions. These apply to non-ASCII character sets, which can have single symbols (called collating elements) that are represented with more than one character. They can also have several characters that are equivalent for collating, or sorting, purposes. (For example, in French, a plain “e” and a grave-accented “è” are equivalent.) These sequences are:
Multicharacter collating elements enclosed between ‘[.’ and ‘.]’. For example, if ‘ch’ is a collating element, then ‘[[.ch.]]’ is a regexp that matches this collating element, whereas ‘[ch]’ is a regexp that matches either ‘c’ or ‘h’.
Locale-specific names for a list of characters that are equal. The name is enclosed between ‘[=’ and ‘=]’. For example, the name ‘e’ might be used to represent all of “e,” “ê,” “è,” and “é.” In this case, ‘[[=e=]]’ is a regexp that matches any of ‘e’, ‘ê’, ‘é’, or ‘è’.
These features are very valuable in non-English-speaking locales.
CAUTION: The library functions that
gawk
uses for regular expression matching currently recognize only POSIX character classes; they do not recognize collating symbols or equivalence classes.
Inside a bracket expression, an opening bracket (‘[’) that does not start a character class, collating element or equivalence class is taken literally. This is also true of ‘.’ and ‘*’.
Next: Leftmost Longest, Previous: Regexp Operators, Up: Regexp [Contents][Index]