Efficiently merge / sort / unique large number of text files
A simple fix, works at least in Bash, since printf
is builtin, and the command line argument limits don't apply to it:
printf "%s\0" * | xargs -0 cat | sort -u > /tmp/bla.txt
(echo * | xargs
would also work, except for the handling of file names with white space etc.)
find . -maxdepth 1 -type f ! -name ".*" -exec cat {} + | sort -u -o /path/to/sorted.txt
This will concatenate all non-hidden regular files in the current directory and sort their combined contents (while removing duplicated lines) into the file /path/to/sorted.txt
.
With GNU sort
, and a shell where printf
is built-in (all POSIX-like ones nowadays except some variants of pdksh
):
printf '%s\0' * | sort -u --files0-from=- > output
Now, a problem with that is that because the two components of that pipeline are run concurrently and independently, by the time the left one expands the *
glob, the right one may have created the output
file already which could cause problem (maybe not with -u
here) as output
would be both an input and output file, so you may want to have the output go to another directory (> ../output
for instance), or make sure the glob doesn't match the output file.
Another way to address it in this instance is to write it:
printf '%s\0' * | sort -u --files0-from=- -o output
That way, it's sort
opening output
for writing and (in my tests), it won't do it before it has received the full list of files (so long after the glob has been expanded). It will also avoid clobbering output
if none of the input files are readable.
Another way to write it with zsh
or bash
sort -u --files0-from=<(printf '%s\0' *) -o output
That's using process substitution (where <(...)
is replaced by a file path that refers to the reading end of the pipe printf
is writing to). That feature comes from ksh
, but ksh
insists in making the expansion of <(...)
a separate argument to the command so you can't use it with the --option=<(...)
syntax. It would work with this syntax though:
sort -u --files0-from <(printf '%s\0' *) -o output
Note that you'll see a difference from approaches that feed the output of cat
on the files in cases where there are files that don't end in a newline character:
$ printf a > a
$ printf b > b
$ printf '%s\0' a b | sort -u --files0-from=-
a
b
$ printf '%s\0' a b | xargs -r0 cat | sort -u
ab
Also note that sort
sorts using the collation algorithm in the locale (strcollate()
), and sort -u
reports one of each set of lines that sort the same by that algorithm, not unique lines at byte level. If you only care about lines being unique at byte level and don't care so much about the order they're sorted on, you may want to fix the locale to C where the sorting is based on byte values (memcmp()
; that would probably speed things up significantly):
printf '%s\0' * | LC_ALL=C sort -u --files0-from=- -o output