Next: Labels Program, Previous: Alarm Program, Up: Miscellaneous Programs [Contents][Index]
The system tr utility transliterates characters. For example, it is
often used to map uppercase letters into lowercase for further processing:
generate data | tr 'A-Z' 'a-z' | process data …
tr requires two lists of characters.78 When processing the input, the
first character in the first list is replaced with the first character
in the second list, the second character in the first list is replaced
with the second character in the second list, and so on. If there are
more characters in the “from” list than in the “to” list, the last
character of the “to” list is used for the remaining characters in the
“from” list.
Once upon a time,
a user proposed adding a transliteration function
to gawk.
The following program was written to
prove that character transliteration could be done with a user-level
function. This program is not as complete as the system tr utility,
but it does most of the job.
The translate program was written long before gawk
acquired the ability to split each character in a string into separate
array elements. Thus, it makes repeated use of the substr(),
index(), and gsub() built-in functions (see section String-Manipulation Functions). There are two functions. The first, stranslate(),
takes three arguments:
fromA list of characters from which to translate
toA list of characters to which to translate
targetThe string on which to do the translation
Associative arrays make the translation part fairly easy. t_ar holds
the “to” characters, indexed by the “from” characters. Then a simple
loop goes through from, one character at a time. For each character
in from, if the character appears in target,
it is replaced with the corresponding to character.
The translate() function calls stranslate(), using $0
as the target. The main program sets two global variables, FROM and
TO, from the command line, and then changes ARGV so that
awk reads from the standard input.
Finally, the processing rule simply calls translate() for each record:
# translate.awk --- do tr-like stuff
# Bugs: does not handle things like tr A-Z a-z; it has
# to be spelled out. However, if `to' is shorter than `from',
# the last character in `to' is used for the rest of `from'.
function stranslate(from, to, target, lf, lt, ltarget, t_ar, i, c,
result)
{
lf = length(from)
lt = length(to)
ltarget = length(target)
for (i = 1; i <= lt; i++)
t_ar[substr(from, i, 1)] = substr(to, i, 1)
if (lt < lf)
for (; i <= lf; i++)
t_ar[substr(from, i, 1)] = substr(to, lt, 1)
for (i = 1; i <= ltarget; i++) {
c = substr(target, i, 1)
if (c in t_ar)
c = t_ar[c]
result = result c
}
return result
}
function translate(from, to)
{
return $0 = stranslate(from, to, $0)
}
# main program
BEGIN {
if (ARGC < 3) {
print "usage: translate from to" > "/dev/stderr"
exit
}
FROM = ARGV[1]
TO = ARGV[2]
ARGC = 2
ARGV[1] = "-"
}
{
translate(FROM, TO)
print
}
It is possible to do character transliteration in a user-level
function, but it is not necessarily efficient, and we (the gawk
developers) started to consider adding a built-in function. However,
shortly after writing this program, we learned that Brian Kernighan
had added the toupper() and tolower() functions to his
awk (see section String-Manipulation Functions). These functions handle the
vast majority of the cases where character transliteration is necessary,
and so we chose to simply add those functions to gawk as well
and then leave well enough alone.
An obvious improvement to this program would be to set up the
t_ar array only once, in a BEGIN rule. However, this
assumes that the “from” and “to” lists
will never change throughout the lifetime of the program.
Another obvious improvement is to enable the use of ranges,
such as ‘a-z’, as allowed by the tr utility.
Look at the code for cut.awk (see section Cutting Out Fields and Columns)
for inspiration.
On some older
systems, including Solaris, the system version of tr may require
that the lists be written as range expressions enclosed in square brackets
(‘[a-z]’) and quoted, to prevent the shell from attempting a
file name expansion. This is not a feature.
Next: Labels Program, Previous: Alarm Program, Up: Miscellaneous Programs [Contents][Index]