Next: gawk split records, Up: Records [Contents][Index]
awk
Records are separated by a character called the record separator.
By default, the record separator is the newline character.
This is why records are, by default, single lines.
To use a different character for the record separator,
simply assign that character to the predefined variable RS
.
Like any other variable,
the value of RS
can be changed in the awk
program
with the assignment operator, ‘=’
(see section Assignment Expressions).
The new record-separator character should be enclosed in quotation marks,
which indicate a string constant. Often, the right time to do this is
at the beginning of execution, before any input is processed,
so that the very first record is read with the proper separator.
To do this, use the special BEGIN
pattern
(see section The BEGIN
and END
Special Patterns).
For example:
awk 'BEGIN { RS = "u" } { print $0 }' mail-list
changes the value of RS
to ‘u’, before reading any input.
The new value is a string whose first character is the letter “u”; as a result, records
are separated by the letter “u”. Then the input file is read, and the second
rule in the awk
program (the action with no pattern) prints each
record. Because each print
statement adds a newline at the end of
its output, this awk
program copies the input
with each ‘u’ changed to a newline. Here are the results of running
the program on mail-list:
$ awk 'BEGIN { RS = "u" } > { print $0 }' mail-list
-| Amelia 555-5553 amelia.zodiac -| sq -| e@gmail.com F -| Anthony 555-3412 anthony.assert -| ro@hotmail.com A -| Becky 555-7685 becky.algebrar -| m@gmail.com A -| Bill 555-1675 bill.drowning@hotmail.com A -| Broderick 555-0542 broderick.aliq -| otiens@yahoo.com R -| Camilla 555-2912 camilla.inf -| sar -| m@skynet.be R -| Fabi -| s 555-1234 fabi -| s. -| ndevicesim -| s@ -| cb.ed -| F -| J -| lie 555-6699 j -| lie.perscr -| tabor@skeeve.com F -| Martin 555-6480 martin.codicib -| s@hotmail.com A -| Sam -| el 555-3430 sam -| el.lanceolis@sh -| .ed -| A -| Jean-Pa -| l 555-2127 jeanpa -| l.campanor -| m@ny -| .ed -| R -|
Note that the entry for the name ‘Bill’ is not split. In the original data file (see section Data files for the Examples), the line looks like this:
Bill 555-1675 bill.drowning@hotmail.com A
It contains no ‘u’, so there is no reason to split the record,
unlike the others, which each have one or more occurrences of the ‘u’.
In fact, this record is treated as part of the previous record;
the newline separating them in the output
is the original newline in the data file, not the one added by
awk
when it printed the record!
Another way to change the record separator is on the command line, using the variable-assignment feature (see section Other Command-Line Arguments):
awk '{ print $0 }' RS="u" mail-list
This sets RS
to ‘u’ before processing mail-list.
Using an alphabetic character such as ‘u’ for the record separator is highly likely to produce strange results. Using an unusual character such as ‘/’ is more likely to produce correct behavior in the majority of cases, but there are no guarantees. The moral is: Know Your Data.
gawk
allows RS
to be a full regular expression
(discussed shortly; see section Record Splitting with gawk
). Even so, using
a regular expression metacharacter, such as ‘.’ as the single
character in the value of RS
has no special effect: it is
treated literally. This is required for backwards compatibility with
both Unix awk
and with POSIX.
When using regular characters as the record separator,
there is one unusual case that occurs when gawk
is
being fully POSIX-compliant (see section Command-Line Options).
Then, the following (extreme) pipeline prints a surprising ‘1’:
$ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }' -| 1
There is one field, consisting of a newline. The value of the built-in
variable NF
is the number of fields in the current record.
(In the normal case, gawk
treats the newline as whitespace,
printing ‘0’ as the result. Most other versions of awk
also act this way.)
Reaching the end of an input file terminates the current input record,
even if the last character in the file is not the character in RS
.
(d.c.)
The empty string ""
(a string without any characters)
has a special meaning
as the value of RS
. It means that records are separated
by one or more blank lines and nothing else.
See section Multiple-Line Records for more details.
If you change the value of RS
in the middle of an awk
run,
the new value is used to delimit subsequent records, but the record
currently being processed, as well as records already processed, are not
affected.
After the end of the record has been determined, gawk
sets the variable RT
to the text in the input that matched
RS
.
Next: gawk split records, Up: Records [Contents][Index]