Uniq Program (The GNU Awk User’s Guide)

11.2.6 Printing Nonduplicated Lines of Text

The uniq utility reads sorted lines of data on its standard input, and by default removes duplicate lines. In other words, it only prints unique lines—hence the name. uniq has a number of options. The usage is as follows:

uniq [-udc [-f n] [-s n]] [inputfile [outputfile]]

The options for uniq are:

-d: Print only repeated (duplicated) lines.
-u: Print only nonrepeated (unique) lines.
-c: Count lines. This option overrides -d and -u. Both repeated and nonrepeated lines are counted.
-f n: Skip n fields before comparing lines. The definition of fields is similar to awk’s default: nonwhitespace characters separated by runs of spaces and/or TABs.
-s n: Skip n characters before comparing lines. Any fields specified with -f are skipped first.
inputfile: Data is read from the input file named on the command line, instead of from the standard input.
outputfile: The generated output is sent to the named output file, instead of to the standard output.

Normally uniq behaves as if both the -d and -u options are provided.

uniq uses the getopt() library function (see Processing Command-Line Options) and the join() library function (see Merging an Array into a String).

The program begins with a usage() function and then a brief outline of the options and their meanings in comments:

# uniq.awk --- do uniq in awk
#
# Requires getopt() and join() library functions


function usage()
{
    print("Usage: uniq [-udc [-f fields] [-s chars]] " \
          "[ in [ out ]]") > "/dev/stderr"
    exit 1
}

# -c    count lines. overrides -d and -u
# -d    only repeated lines
# -u    only nonrepeated lines
# -f n  skip n fields
# -s n  skip n characters, skip fields first

The POSIX standard for uniq allows options to start with ‘+’ as well as with ‘-’. An initial BEGIN rule traverses the arguments changing any leading ‘+’ to ‘-’ so that the getopt() function can parse the options:

# As of 2020, '+' can be used as the option character in addition to '-'
# Previously allowed use of -N to skip fields and +N to skip
# characters is no longer allowed, and not supported by this version.

BEGIN {
    # Convert + to - so getopt can handle things
    for (i = 1; i < ARGC; i++) {
        first = substr(ARGV[i], 1, 1)
        if (ARGV[i] == "--" || (first != "-" && first != "+"))
            break
        else if (first == "+")
            # Replace "+" with "-"
            ARGV[i] = "-" substr(ARGV[i], 2)
    }
}

The next BEGIN rule deals with the command-line arguments and options. If no options are supplied, then the default is taken, to print both repeated and nonrepeated lines. The output file, if provided, is assigned to outputfile. Early on, outputfile is initialized to the standard output, /dev/stdout:

BEGIN {
    count = 1
    outputfile = "/dev/stdout"
    opts = "udcf:s:"
    while ((c = getopt(ARGC, ARGV, opts)) != -1) {
        if (c == "u")
            non_repeated_only++
        else if (c == "d")
            repeated_only++
        else if (c == "c")
            do_count++
        else if (c == "f")
            fcount = Optarg + 0
        else if (c == "s")
            charcount = Optarg + 0
        else
            usage()
    }

    for (i = 1; i < Optind; i++)
        ARGV[i] = ""

    if (repeated_only == 0 && non_repeated_only == 0)
        repeated_only = non_repeated_only = 1

    if (ARGC - Optind == 2) {
        outputfile = ARGV[ARGC - 1]
        ARGV[ARGC - 1] = ""
    }
}

The following function, are_equal(), compares the current line, $0, to the previous line, last. It handles skipping fields and characters. If no field count and no character count are specified, are_equal() returns one or zero depending upon the result of a simple string comparison of last and $0.

Otherwise, things get more complicated. If fields have to be skipped, each line is broken into an array using split() (see String-Manipulation Functions); the desired fields are then joined back into a line using join(). The joined lines are stored in clast and cline. If no fields are skipped, clast and cline are set to last and $0, respectively. Finally, if characters are skipped, substr() is used to strip off the leading charcount characters in clast and cline. The two strings are then compared and are_equal() returns the result:

function are_equal(    n, m, clast, cline, alast, aline)
{
    if (fcount == 0 && charcount == 0)
        return (last == $0)


    if (fcount > 0) {
        n = split(last, alast)
        m = split($0, aline)
        clast = join(alast, fcount+1, n)
        cline = join(aline, fcount+1, m)
    } else {
        clast = last
        cline = $0
    }
    if (charcount) {
        clast = substr(clast, charcount + 1)
        cline = substr(cline, charcount + 1)
    }


    return (clast == cline)
}

The following two rules are the body of the program. The first one is executed only for the very first line of data. It sets last equal to $0, so that subsequent lines of text have something to be compared to.

The second rule does the work. The variable equal is one or zero, depending upon the results of are_equal()’s comparison. If uniq is counting repeated lines, and the lines are equal, then it increments the count variable. Otherwise, it prints the line and resets count, because the two lines are not equal.

If uniq is not counting, and if the lines are equal, count is incremented. Nothing is printed, as the point is to remove duplicates. Otherwise, if uniq is counting repeated lines and more than one line is seen, or if uniq is counting nonrepeated lines and only one line is seen, then the line is printed, and count is reset.

Finally, similar logic is used in the END rule to print the final line of input data:

NR == 1 {
    last = $0
    next
}

{
    equal = are_equal()

    if (do_count) {    # overrides -d and -u
        if (equal)
            count++
        else {
            printf("%4d %s\n", count, last) > outputfile
            last = $0
            count = 1    # reset
        }
        next
    }

    if (equal)
        count++
    else {
        if ((repeated_only && count > 1) ||
            (non_repeated_only && count == 1))
                print last > outputfile
        last = $0
        count = 1
    }
}

END {
    if (do_count)
        printf("%4d %s\n", count, last) > outputfile

    else if ((repeated_only && count > 1) ||
            (non_repeated_only && count == 1))
        print last > outputfile
    close(outputfile)
}

As a side note, this program does not follow our recommended convention of naming global variables with a leading capital letter. Doing that would make the program a little easier to follow.