What is the difference between "sort -u" and "sort | uniq"?

Everywhere I see a person requiring to get an arranged, one-of-a-kind checklist, they constantly pipeline to sort | uniq. I've never ever seen any kind of instances where a person makes use of sort -u rather. Why not? What is the distinction, and also why is it far better to make use of uniq than the one-of-a-kind flag to sort?

2022-07-14 04:54:20
Source Share
Answers: 3

One distinction is that uniq has a variety of valuable added alternatives, such as missing areas for contrast and also counting the variety of reps of a value. sort is -u flag just applies the capability of the basic uniq command.

2022-07-15 02:04:05

With POSIX compliant sorts and uniqs (GNU uniq is currently not compliant in that regard), there's a difference in that sort uses the locale's collating algorithm to compare strings (will typically use strcoll() to compare strings) while uniq checks for byte-value identity (will typically use strcmp())¹.

That matters for at least two reasons.

  • In some locales, especially on GNU systems, there are different characters that sort the same. For instance, in the en_US.UTF-8 locale on a GNU system, all the ①②③④⑤⑥⑦⑧⑨⑩... characters² and many others sort the same because their sort order is not defined. The 0123456789 arabic digits sort the same as their Eastern Arabic Indic counterparts (٠١٢٣٤٥٦٧٨٩).

    For sort -u, ① sorts the same as ② and 0123 the same as ٠١٢٣ so sort -u would retain only one of each, while for uniq (not GNU uniq which uses strcoll() (except with -f)), ① is different from ② and 0123 different from ٠١٢٣, so uniq would consider all 4 unique.

  • strcoll can only compare strings of valid characters (the behaviour is undefined as per POSIX when the input has sequences of bytes that don't form valid characters) while strcmp() doesn't care about characters since it only does byte-to-byte comparison. So that's another reason why sort -u may not give you all the unique lines if some of them don't form valid text. sort|uniq, while still unspecified on non-text input, in practice is more likely to give you unique lines for that reason.

Beside those subtleties, one thing that hasn't been noted so far is that uniq compares whole line lexically, while sort's -u compares based on the sort specification given on the command line.

$ printf '%s\n' 'a b' 'a c' | sort -uk 1,1
a b
$ printf '%s\n' 'a b' 'a c' | sort -k 1,1 | uniq
a b
a c

$ printf '%s\n' 0 -0 +0 00 '' | sort -n | uniq

$ printf '%s\n' 0 -0 +0 00 '' | sort -nu

¹ Prior versions of the POSIX spec were causing confusion however by listing the LC_COLLATE variable as one affecting uniq, that was removed in the 2018 edition and the behaviour clarified following that discussion mentioned above. See the corresponding Austin group bug

² 2019 edit. Those have since been fixed, but over 95% of Unicode code points still have an undefined order as of version 2.30 of the GNU libc. You can test with ????? instead for instance in newer versions

2022-07-15 02:01:28

sort | uniq existed prior to sort -u, and also works with a bigger series of systems, although mostly all modern-day systems do sustain -u - - it is POSIX. It is primarily a throwback to the days when sort -u really did not exist (and also individuals do not often tend to transform their approaches if the manner in which they recognize remains to function, simply consider ifconfig vs. ip fostering).

Both were most likely combined due to the fact that getting rid of matches within a documents calls for arranging (at the very least, in the typical instance), and also is an exceptionally usual usage instance of sort. It is additionally much faster inside as an outcome of having the ability to do both procedures at the very same time (and also as a result of the reality that it does not call for IPC (Inter-process communication) in between uniq and also sort). Specifically if the documents allows, sort -u will likely make use of less intermediate documents to sort the information.

On my system I continually get outcomes like this:

$ dd if=/dev/urandom of=/dev/shm/file bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 8.95208 s, 11.7 MB/s
$ time sort -u /dev/shm/file >/dev/null

real        0m0.500s
user        0m0.767s
sys         0m0.167s
$ time sort /dev/shm/file | uniq >/dev/null

real        0m0.772s
user        0m1.137s
sys         0m0.273s

It additionally does not mask the return code of sort, which might be necessary (in modern-day coverings there are means to get this, as an example, bash is $PIPESTATUS array, yet this had not been constantly real).

2022-07-15 02:00:43