Fastest `uniq` tool in linux
Let's consider how each solution works.
uniq
This requires that the file already be sorted. If not, you have to pipe it throughsort
first, which means thatsort
has to read the entire file into memory, reorder it (O(n log n)
), and then write it into the pipe. The work ofuniq
is very cheap, since it only has to compare adjacent lines of its input.sort -u
This combines the work ofsort | uniq
. This has to collect all the unique inputs into memory like theawk
script does, but it also then wastes time sorting them before producing the output. This isO(n log n)
, although in this casen
is the number of unique items, not all the inputs. So it's better than the pipe.sed
I'm not sure why you listed this, as I can't think of a good way to do this withsed
at all. Maybe if you first sort it and pipe to ased
script, there's a way to compare adjacent lines. Sosed
would just be doing whatuniq
does, anduniq
probably does it about as efficiently as possible.awk
This is likely the best because it only does the minimal amount of work necessary. As it reads each line, it does an efficient hash lookup to see if the line is already in its memory, and only stores the unique lines as hash keys, and a counter as the value. (If the line wasn't previously present, the condition will be true, so the line will be printed. Otherwise it won't.) This usesO(n)
time andO(uniq n)
memory.
Every method will use a considerable amount of memory, either for sorting the input or keeping track of which inputs have seen so they can remove duplicates.