UNIX Power Tools

UNIX Power ToolsSearch this book
Previous: 29.6 Counting Lines, Words, and Characters: wc Chapter 29
Spell Checking, Word Counting, and Textual Analysis
Next: 29.8 Find a a Doubled Word
 

29.7 Count How Many Times Each Word Is Used

wordfreq
The wordfreq script counts the number of occurrences of each word in its input. If you give it files, it reads from them; otherwise it reads standard input. The -i option folds uppercase into lowercase (uppercase letters will count the same as lowercase).

Here's this book's Preface run through wordfreq:

% wordfreq ch00
 141 the
  98 to
  84 and
  84 of
  71 a
  55 in
  44 that
  38 book
  32 we
  ...

The script was taken from a long-ago Usenet (1.33) posting by Carl Brandauer. Here is Carl's original script (with a few small edits):


tr 


sort 
uniq 


-4 
cat $* |   # tr reads the standard input
tr "[A-Z]" "[a-z]" |   # Convert all uppercase to lowercase
tr -cs "a-z'" "\012" |   # replace all characters not a-z or '
   # with a new line. i.e. one word per line
sort |   # uniq expects sorted input
uniq -c |   # Count number of times each word appears
sort +0nr +1d |   # Sort first from most to least frequent,
   # then alphabetically
pr -w80 -4 -h "Concordance for $*"     # Print in four columns

The version on the disc is somewhat different. It adjusts the tr commands for the script's -i option. The disc version also doesn't use pr to make output in four columns, though you can add that to your copy of the script - or just pipe the wordfreq output through pr on the command line when you need it.

The second tr command above (with the -cs options) is for the Berkeley version of tr. For System V tr, the command should be:

tr -cs "[a-z]'" "[\012*]"

If you aren't sure which version of tr you have, see article 35.11. You could use deroff (29.10) instead.

One of the beauties of a simple script like this is that you can tweak it if you don't like the way it counts. For example, if you want hyphenated words like copy-editor to count as one, add a hyphen to the tr -cs expression: "[a-z]'-" (System V) or "-a-z'" (Berkeley).

- JP, TOR


Previous: 29.6 Counting Lines, Words, and Characters: wc UNIX Power ToolsNext: 29.8 Find a a Doubled Word
29.6 Counting Lines, Words, and Characters: wc Book Index29.8 Find a a Doubled Word

The UNIX CD Bookshelf NavigationThe UNIX CD BookshelfUNIX Power ToolsUNIX in a NutshellLearning the vi Editorsed & awkLearning the Korn ShellLearning the UNIX Operating System