How to sort big files?
GNU sort
(which is the default on most Linux systems), has a --parallel
option. From http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html:
‘--parallel=n’
Set the number of sorts run in parallel to n. By default, n is set to the number of available processors, but limited to 8, as there are diminishing performance gains after that. Note also that using n threads increases the memory usage by a factor of log n. Also see nproc invocation.
Since your cpu has 2 cores, you could do:
sort --parallel=2 -uo list-sorted.txt list.txt
It is better to specify the actual number of cores since there may appear to be more due to the processor having hyper-threading.
You could also experiment with nice
to influence the processor scheduling priority and ionice
to influence I/O scheduling. You can increase the priority over other processes like this, I don't think this will give you large savings as they are usually better for making sure a background process doesn't use too much resources. Never-the-less you can combine them with something like:
nice -n -20 ionice -c2 -n7 sort --parallel=2 -uo list-sorted.txt list.txt
Note also that as Gilles commented, using a single GNU sort command will be faster than any other method of breaking down the sorting as the algorithm is already optimised to handle large files. Anything else will likely just slow things down.
Using the sort
command will probably be the fastest option.
But you'll probably want to fix the locale to C.
sort -u
doesn't report unique lines, but one of each set of lines that sort the same. In the C locale, 2 different lines necessarily don't sort the same, but that's not the case in most UTF-8 based locales on GNU systems.
Also, using the C locale avoids the overhead of having to parse UTF-8 and processing complex sort orders so would improve performance dramatically.
So:
LC_ALL=C sort -u file
You can also improve performance by using a faster drive (or a different drive from the one where the input and/or output files are) for the temporary files (using -T
or $TMPDIR
environment variable), or by fiddling with the -S
option supported by some sort
implementations).
For some type of input or for slow storage, using the --compress-program
option of GNU sort
(for instance with lzop
) might improve performance in addition to storage usage.
Now just a note for those objecting (rightly to some extent) that it will not be the correct order:
I agree that as a human, I'd like to see Stéphane sort in between Stefan and Stephanie, but:
- A computer would want Stéphane to sort after since
é
(at least when expressed as U+00E9) as a character or the bytes of its UTF-8 encoding sorts after (in terms of codepoint or byte value). That's a sort order that is very simple to implement and is a strict total order and has no surprise.Your locale's sort order will likely not be satisfactory in many cases either even to a human. For example on my system with the default en_GB.utf8 locale:
Stéphane and Stéphane (one with U+00E9, the other with eU+0301) don't sort the same:
$ printf '%b\n' 'Ste\u0301phane' 'St\u00e9phane' | sort -u Stéphane Stéphane
but ③, ①, ② all sort the same (obviously a bug in those locale definitions¹):
$ printf '%s\n' ③ ① ② | sort -u ③
Here, it's ③, but it could just as well have been ① or ②
So IMO, chances are you always want
sort -u
with LC_ALL=C, if you want unique lines. And if you want that resulting list to be sorted in the user's sort order, pipe it tosort
again:LC_ALL=C sort -u | sort LC_ALL=C sort | LC_ALL=C uniq -c | sort -k2
¹ 2019 edit. the order of ① ② ③ ④ ⑤... has since been fixed in newer versions of the GNU libc, but as of 2.30, over 95% of characters still don't have a defined order, you can replace ① ② ③ ④ ⑤ with for instance. Hopefully, GNU locales will eventually be fixed completely (they will have to if they want to comply to the next revision of the standard) and the problem will then be limited to user-defined locales