[Chapter 45] 45.30 Grabbing Parts of a String

45.30 Grabbing Parts of a String

How can you parse (split, search) a string of text to find the last word, the second column, and so on? There are a lot of different ways. Pick the one that works best for you - or invent another one! (UNIX has slots of ways to work with strings of text.)

45.30.1 Matching with expr

The expr command (45.28) can grab part of a string with a regular expression. The example below is from a shell script whose last command-line argument is a filename. The two commands below use expr to grab the last argument and all arguments except the last one. The "$*" gives expr a list of all command-line arguments in a single word. (Using "$@" (44.15) here wouldn't work because it gives individually quoted arguments. expr needs all arguments in one word.)

last=`expr "$*" : '.* \(.*\)'`    # LAST ARGUMENT
first=`expr "$*" : '\(.*\) .*'`    # ALL BUT LAST ARGUMENT

Let's look at the regular expression that gets the last word. The leading part of the expression, .* , matches as many characters as it can, followed by a space. This includes all words up to and including the last space. After that, the end of the expression, $.*$, matches the last word.

The regular expression that grabs the first words is the same as the previous one - but I've moved the  pair. Now it grabs all words up to but not including the last space. The end of the regular expression, .*, matches the last space and last word - and expr ignores them. So the final .* really isn't needed here (though the space is). I've included that final .* because it follows from the first example.

expr is great when you want to split a string into just two parts. The .* also makes expr good for skipping a variable number of words when you don't know how many words a string will have. But expr is lousy for getting, say, the fourth word in a string. And it's almost useless for handling more than one line of text at a time.

45.30.2 Using echo with awk, colrm, or cut

awk can split lines into words. But awk has a lot of overhead and can take some time to execute, especially on a busy system. The cut (35.14) and colrm (35.15) commands start more quickly than awk but they can't do as much.

All of those utilities are designed to handle multiple lines of text. You can tell awk to handle a single line with its pattern-matching operators and its NR variable. You can also run those utilities with a single line of text, fed to the standard input through a pipe from echo (8.6). For example, to get the third field from a colon-separated string:

string="this:is:just:a:dummy:string"
field3_awk=`echo "$string" | awk -F: '{print $3}'`
field3_cut=`echo "$string" | cut -d: -f3`

Let's combine two echo commands. One sends text to awk, cut, or colrm through a pipe; the utility ignores all the text from columns 1-24, then prints columns 25 to the end of the variable text. The outer echo prints The answer is and that answer. Notice that the inner double quotes are escaped with backslashes to keep the Bourne shell from interpreting them before the inner echo runs:

echo "The answer is `echo \"$text\" | awk '{print substr($0,25)}'`"
echo "The answer is `echo \"$text\" | cut -c25-`"
echo "The answer is `echo \"$text\" | colrm 1 24`"

45.30.3 Using set

The Bourne shell set (44.19) command can be used to parse a single-line string and store it in the command-line parameters (44.15) "$@", $*, $1, $2, and so on. Then you can also loop through the words with a for loop (44.16) and use everything else the shell has for dealing with command-line parameters. Also, you can set the IFS variable (35.21) to control how the shell splits the string.

45.30.4 Using sed

The UNIX sed (34.24) utility is good at parsing input that you may or may not be able to split into words otherwise, at finding a single line of text in a group and outputting it, and many other things. In this example, I want to get the percentage-used of the filesystem mounted on /home. That information is buried in the output of the df (24.9) command. On my system, df output looks like:

% df
Filesystem            kbytes    used   avail capacity  Mounted on
   ...
/dev/sd3c            1294854  914230  251139    78%    /work
/dev/sd4c             597759  534123    3861    99%    /home
   ...

I want the number 99 from the line ending with /home. The sed address / \/home$/ will find that line (including a space before the /home makes sure the address doesn't match a line ending with /something/home). The -n option keeps sed from printing any lines except the line we ask it to print (with its p command). I know that the "capacity" is the only word on the line that ends with a percent sign (%). A space after the first .* makes sure that .* doesn't "eat" the first digit of the number that we want to match by [0-9]. The sed escaped-parenthesis operators (34.10) grab that number. Here goes:

usage=`df | sed -n '/ \/home$/s/.* \([0-9][0-9]*\)%.*/\1/p'`

Combining sed with eval (8.10) lets you set several shell variables at once from parts of the same line. Here's a command line that sets two shell variables from the df output:

eval `df |
sed -n '/ \/home$/s/^[^ ]*  *\([0-9]*\)  *\([0-9]*\).*/kb=\1 u=\2/p'`

The left-hand side of that substitution command has a regular expression that uses sed's escaped parenthesis operators. They grab the "kbytes" and "used" columns from the df output. The right-hand side outputs the two df values with Bourne shell variable-assignment commands to set the kb and u variables. After sed finishes, the resulting command line looks like this:

eval kb=597759 u=534123

Now $kb will give you 597759 and $u contains 534123.

- JP


45.29 Testing Characters in a String with expr		45.31 Nested Command Substitution