What is the difference between "sort -u" and "sort | uniq"?
sort | uniq
existed before sort -u
, and is compatible with a wider range of systems, although almost all modern systems do support -u
-- it's POSIX. It's mostly a throwback to the days when sort -u
didn't exist (and people don't tend to change their methods if the way that they know continues to work, just look at ifconfig
vs. ip
adoption).
The two were likely merged because removing duplicates within a file requires sorting (at least, in the standard case), and is an extremely common use case of sort. It is also faster internally as a result of being able to do both operations at the same time (and due to the fact that it doesn't require IPC between uniq
and sort
). Especially if the file is big, sort -u
will likely use fewer intermediate files to sort the data.
On my system I consistently get results like this:
$ dd if=/dev/urandom of=/dev/shm/file bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 8.95208 s, 11.7 MB/s
$ time sort -u /dev/shm/file >/dev/null
real 0m0.500s
user 0m0.767s
sys 0m0.167s
$ time sort /dev/shm/file | uniq >/dev/null
real 0m0.772s
user 0m1.137s
sys 0m0.273s
It also doesn't mask the return code of sort
, which may be important (in modern shells there are ways to get this, for example, bash
's $PIPESTATUS
array, but this wasn't always true).
One difference is that uniq
has a number of useful additional options, such as skipping fields for comparison and counting the number of repetitions of a value. sort
's -u
flag only implements the functionality of the unadorned uniq
command.
With POSIX compliant sort
s and uniq
s (GNU uniq
is currently not compliant in that regard), there's a difference in that sort
uses the locale's collating algorithm to compare strings (will typically use strcoll()
to compare strings) while uniq
checks for byte-value identity (will typically use strcmp()
)¹.
That matters for at least two reasons.
In some locales, especially on GNU systems, there are different characters that sort the same. For instance, in the en_US.UTF-8 locale on a GNU system, all the ①②③④⑤⑥⑦⑧⑨⑩... characters² and many others sort the same because their sort order is not defined. The 0123456789 arabic digits sort the same as their Eastern Arabic Indic counterparts (٠١٢٣٤٥٦٧٨٩).
For
sort -u
, ① sorts the same as ② and 0123 the same as ٠١٢٣ sosort -u
would retain only one of each, while foruniq
(not GNUuniq
which usesstrcoll()
(except with-i
)), ① is different from ② and 0123 different from ٠١٢٣, souniq
would consider all 4 unique.strcoll
can only compare strings of valid characters (the behaviour is undefined as per POSIX when the input has sequences of bytes that don't form valid characters) whilestrcmp()
doesn't care about characters since it only does byte-to-byte comparison. So that's another reason whysort -u
may not give you all the unique lines if some of them don't form valid text.sort|uniq
, while still unspecified on non-text input, in practice is more likely to give you unique lines for that reason.
Beside those subtleties, one thing that hasn't been noted so far is that uniq
compares whole line lexically, while sort
's -u
compares based on the sort specification given on the command line.
$ printf '%s\n' 'a b' 'a c' | sort -uk 1,1
a b
$ printf '%s\n' 'a b' 'a c' | sort -k 1,1 | uniq
a b
a c
$ printf '%s\n' 0 -0 +0 00 '' | sort -n | uniq
0
-0
+0
00
$ printf '%s\n' 0 -0 +0 00 '' | sort -nu
0
¹ Prior versions of the POSIX spec were causing confusion however by listing the LC_COLLATE
variable as one affecting uniq
, that was removed in the 2018 edition and the behaviour clarified following that discussion mentioned above. See the corresponding Austin group bug
² 2019 edit. Those have since been fixed, but over 95% of Unicode code points still have an undefined order as of version 2.30 of the GNU libc. You can test with instead for instance in newer versions