How to remove duplicate lines inside a text file?
An awk
solution seen on #bash (Freenode):
awk '!seen[$0]++' filename
There's a simple (which is not to say obvious) method using standard utilities which doesn't require a large memory except to run sort
, which in most implementations has specific optimizations for huge files (a good external sort algorithm). An advantage of this method is that it only loops over all the lines inside special-purpose utilities, never inside interpreted languages.
<input nl -b a -s : | # number the lines
sort -t : -k 2 -u | # sort and uniquify ignoring the line numbers
sort -t : -k 1n | # sort according to the line numbers
cut -d : -f 2- >output # remove the line numbers
If all lines begin with a non-whitespace character, you can dispense with some of the options:
<input nl | sort -k 2 -u | sort -k 1n | cut -f 2- >output
For a large amount of duplication, a method that only requires storing a single copy of each line in memory will perform better. With some interpretation overhead, there's a very concise awk script for that (already posted by enzotib):
<input awk '!seen[$0]++'
Less concisely: !seen[$0] {print} {seen[$0] += 1}
, i.e. print the current line if it hasn't been seen yet, then increment the seen
counter for this line (uninitialized variables or array elements have the numerical value 0).
For long lines, you can save memory by keeping only a non-spoofable checksum (e.g. a cryptographic digest) of each line. For example, using SHA-1, you only need 20 bytes plus a constant overhead per line. But computing digests is rather slow; this method will only win if you have a fast CPU (especially one with a hardware accelerator to compute the digests) and not a lot of memory relative to the size of the file and sufficiently long lines. No basic utility lets you compute a checksum for each line; you'd have to bear the interpretation overhead of Perl/Python/Ruby/… or write a dedicated compiled program.
<input perl -MDigest::MD5 -ne '$seen{Digest::MD5::md5($_)}++ or print' >output
sort -u big-csv-file.csv > duplicates-removed.csv
Note that the output file will be sorted.