Back to Basics: Sort and Uniq
Learn the fundamentals of sorting and de-duplicating text on the command line.
If you've been using the command line for a long time, it's easy
to take the commands you use every day for granted. But, if you're
new to the Linux command line, there are several commands that
make your life easier that you may not stumble upon automatically.
In this article, I cover the basics of two commands
that are essential in anyone's arsenal: sort
and uniq
.
The sort
command does exactly what it says: it takes text data as input
and outputs sorted data. There are many scenarios on the command
line when you may need to sort output, such as the output from a command
that doesn't offer sorting options of its own (or the sort arguments
are obscure enough that you just use the sort
command instead). In
other cases, you may have a text file full of data (perhaps generated
with some other script), and you need a quick way to view it in a
sorted form.
Let's start with a file named "test" that contains three lines:
Foo
Bar
Baz
sort
can operate either on STDIN redirection, the input from a pipe,
or, in the case of a file, you also can just specify the file on the
command. So, the three following commands all accomplish the same
thing:
cat test | sort
sort < test
sort test
And the output that you get from all of these commands is:
Bar
Baz
Foo
Sorting Numerical Output
Now, let's complicate the file by adding three more lines:
Foo
Bar
Baz
1. ZZZ
2. YYY
11. XXX
If you run one of the above sort
commands again, this time, you'll
see different output:
11. XXX
1. ZZZ
2. YYY
Bar
Baz
Foo
This is likely not the output you wanted, but it points out an
important fact about sort
. By default, it sorts alphabetically, not
numerically. This means that a line that starts with "11." is
sorted above a line that starts with "1.", and all of the lines that
start with numbers are sorted above lines that start with letters.
To sort numerically, pass sort
the -n
option:
sort -n test
Bar
Baz
Foo
1. ZZZ
2. YYY
11. XXX
Find the Largest Directories on a Filesystem
Numerical sorting comes in handy for a lot of command-line output—in particular, when your command contains a tally of some kind, and you want to see the largest or smallest in the tally. For instance, if you want to find out what files are using the most space in a particular directory and you want to dig down recursively, you would run a command like this:
du -ckx
This command dives recursively into the current directory and doesn't
traverse any other mountpoints inside that directory. It tallies the
file sizes and then outputs each directory in the order it found
them, preceded by the size of the files underneath it in kilobytes.
Of course, if you're running such a command, it's probably because
you want to know which directory is using the most space, and
this is where sort
comes in:
du -ckx | sort -n
Now you'll get a list of all of the directories underneath the
current directory, but this time sorted by file size. If you want
to get even fancier, pipe its output to the tail
command to see the
top ten. On the other hand, if you wanted the largest directories
to be at the top of the output, not the bottom, you would add the
-r
option, which tells sort
to reverse the order. So to get the top
ten (well, top eight—the first line is the total, and the next line is
the size of the current directory):
du -ckx | sort -rn | head
This works, but often people using the du
command want to see
sizes in more readable output than kilobytes. The du
command offers
the -h
argument that provides "human-readable" output. So, you'll
see output like 9.6G
instead of 10024764
with the -k
option. When
you pipe that human-readable output to sort
though, you won't get
the results you expect by default, as it will sort 9.6G above 9.6K,
which would be above 9.6M.
The sort
command has a -h
option of its own, and it acts like -n
, but
it's able to parse standard human-readable numbers and sort them
accordingly. So, to see the top ten largest directories in your
current directory with human-readable output, you would type this:
du -chx | sort -rh | head
Removing Duplicates
The sort command isn't limited to sorting one file. You might pipe multiple files into it or list multiple files as arguments on the command line, and it will combine them all and sort them. Unfortunately though, if those files contain some of the same information, you will end up with duplicates in the sorted output.
To remove duplicates,
you need the uniq
command, which by default removes any duplicate
lines that are adjacent to each other from its input and outputs
the results. So, let's say you had two files that were different
lists of names:
cat namelist1.txt
Jones, Bob
Smith, Mary
Babbage, Walter
cat namelist2.txt
Jones, Bob
Jones, Shawn
Smith, Cathy
You could remove the duplicates by piping to uniq
:
sort namelist1.txt namelist2.txt | uniq
Babbage, Walter
Jones, Bob
Jones, Shawn
Smith, Cathy
Smith, Mary
The uniq
command has more tricks up its sleeve than this.
It also can output only the duplicated lines, so you can
find duplicates in a set of files quickly by adding the -d
option:
sort namelist1.txt namelist2.txt | uniq -d
Jones, Bob
You even can have uniq
provide a tally of how many times it has
found each entry with the -c
option:
sort namelist1.txt namelist2.txt | uniq -c
1 Babbage, Walter
2 Jones, Bob
1 Jones, Shawn
1 Smith, Cathy
1 Smith, Mary
As you can see, "Jones, Bob" occurred the most times, but if you
had a lot of lines, this sort of tally might be less useful for
you, as you'd like the most duplicates to bubble up to the top.
Fortunately, you have the sort
command:
sort namelist1.txt namelist2.txt | uniq -c | sort -nr
2 Jones, Bob
1 Smith, Mary
1 Smith, Cathy
1 Jones, Shawn
1 Babbage, Walter
Conclusion
I hope these cases of using sort
and uniq
with realistic
examples show you how powerful these simple command-line tools are.
Half the secret with these foundational command-line tools is
to discover (and remember) they exist so that they'll be at your command
the next time you run into a problem they can solve.