Some of my directories - my bin (4.2), for instance - have some text files (like shell scripts and documentation) as well as non-text files (executable binary files, compressed files, archives, etc.). If I'm trying to find a certain file - with grep (27.1) or a pager (25.3, 25.4)- the non-text files can print garbage on my screen. I want some way to say "only look at the files that have text in them."
The findtext shell script does that. It runs file (25.8) to guess what's in each file. It only prints filenames of text files.
So, for example, instead of typing:
%egrep something *
I type:
`...` |
% |
---|
Here's the script, then some explanation of how to set it up on your system:
#!/bin/sh # PIPE OUTPUT OF file THROUGH sed TO PRINT FILENAMES FROM LINES # WE LIKE. NOTE: DIFFERENT VERSIONS OF file RETURN DIFFERENT # MESSAGES. CHECK YOUR SYSTEM WITH strings /usr/bin/file OR # cat /etc/magic AND ADAPT THIS. /usr/bin/file "$@" | sed -n ' /MMDF mailbox/b print /Interleaf ASCII document/b print /PostScript document/b print /Frame Maker MIF file/b print /c program text/b print /fortran program text/b print /assembler program text/b print /shell script/b print /c-shell script/b print /shell commands/b print /c-shell commands/b print /English text/b print /ascii text/b print /\[nt\]roff, tbl, or eqn input text/b print /executable .* script/b print b :print s/:[TAB].*//p'
The script is simple: It runs file on the command-line arguments. The output of file looks like this:
COPY2PC: directory Ex24348: empty FROM_consult.tar.Z: compressed data block compressed 16 bits GET_THIS: ascii text hmo: English text msg: English text 1991.ok: [nt]roff, tbl, or eqn input text
The output is piped to a
sed (34.24)
script that selects the lines that seem to be from text files - after the
print
label, the script strips off everything after the filename
(starting at the colon) and prints the filename.
Different versions of file produce different output. Some versions also read an /etc/magic file. To find the kinds of names your file calls text files, use commands like:
%strings /usr/bin/file > possible
%cat /etc/magic >> possible
%vi possible
The possible file will have a list of descriptions that strings found in the file binary; some of them are for text files. If your system has an /etc/magic file, it will have lines like these:
0 long 0x1010101 MMDF mailbox 0 string <!OPS Interleaf ASCII document 0 string %! PostScript document 0 string <MIFFile Frame Maker MIF file
Save the descriptions of text-type files from the right-hand column.
Then, turn each line of your edited possible file into a sed command:
b print |
|
---|
Watch for special characters in the file descriptions. I had to handle two special cases in the last two lines of the script above:
I had to change the string executable %s script
from our file command to /executable .* script/b print
in the sed script.
That's because our file command replaces %s
with a name
like /bin/ksh
.
Characters that sed will treat as a regular expression,
such as the brackets in [nt]roff
, need to be escaped with backslashes.
I used \[nt\]troff
in the script.
If you have perl (37.1), you can make a simpler version of this script, since perl has a built-in test for whether or not a file is a text file. Perl picks a "text file" by checking the first block or so for strange control codes or metacharacters. If there are too many (more than 10%), it's not a text file. You can't tune the Perl script to, for example, skip a certain kind of file by type. But the Perl version is simpler! It looks like this:
%perl -le '-T && print while $_ = shift' *
csh_init sh_init | If you want to put that into an alias (10.2), the C shell's quoting problems (47.2, 8.15) make it tough to do. Thanks to makealias (10.8), though, here's an alias that does the job: |
---|
alias findtext 'perl -le '\''-T && print while $_ = shift'\'' *'
-