Regexp Field Splitting (The GNU Awk User’s Guide)

4.5.2 Using Regular Expressions to Separate Fields

The previous subsection discussed the use of single characters or simple strings as the value of FS. More generally, the value of FS may be a string containing any regular expression. In this case, each match in the record for the regular expression separates fields. For example, the assignment:

FS = ", \t"

makes every area of an input line that consists of a comma followed by a space and a TAB into a field separator.

For a less trivial example of a regular expression, try using single spaces to separate fields the way single commas are used. FS can be set to "[ ]" (left bracket, space, right bracket). This regular expression matches a single space and nothing else (see Regular Expressions).

There is an important difference between the two cases of ‘FS = " "’ (a single space) and ‘FS = "[ \t\n]+"’ (a regular expression matching one or more spaces, TABs, or newlines). For both values of FS, fields are separated by runs (multiple adjacent occurrences) of spaces, TABs, and/or newlines. However, when the value of FS is " ", awk first strips leading and trailing whitespace from the record and then decides where the fields are. For example, the following pipeline prints ‘b’:

$ echo ' a b c d ' | awk '{ print $2 }'
-| b

However, this pipeline prints ‘a’ (note the extra spaces around each letter):

$ echo ' a  b  c  d ' | awk 'BEGIN { FS = "[ \t\n]+" }
>                                  { print $2 }'
-| a

In this case, the first field is null, or empty.

The stripping of leading and trailing whitespace also comes into play whenever $0 is recomputed. For instance, study this pipeline:

$ echo '   a b c d' | awk '{ print; $2 = $2; print }'
-|    a b c d
-| a b c d

The first print statement prints the record as it was read, with leading whitespace intact. The assignment to $2 rebuilds $0 by concatenating $1 through $NF together, separated by the value of OFS (which is a space by default). Because the leading whitespace was ignored when finding $1, it is not part of the new $0. Finally, the last print statement prints the new $0.

There is an additional subtlety to be aware of when using regular expressions for field splitting. It is not well specified in the POSIX standard, or anywhere else, what ‘^’ means when splitting fields. Does the ‘^’ match only at the beginning of the entire record? Or is each field separator a new string? It turns out that different awk versions answer this question differently, and you should not rely on any specific behavior in your programs. (d.c.)

As a point of information, BWK awk allows ‘^’ to match only at the beginning of the record. gawk also works this way. For example:

$ echo 'xxAA  xxBxx  C' |
> gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++)
>                             printf "-->%s<--\n", $i }'
-| --><--
-| -->AA<--
-| -->xxBxx<--
-| -->C<--

Finally, field splitting with regular expressions works differently than regexp matching with the sub(), gsub(), and gensub() (see String-Manipulation Functions). Those functions allow a regexp to match the empty string; field splitting does not. Thus, for example ‘FS = "()"’ does not split fields between characters.