Remove duplicate lines while keeping the order of the lines
I doubt it will make a difference but, just in case, here's how to do the same thing in Perl:
perl -ne 'print if ++$k{$_}==1' out.txt
If the problem is keeping the unique lines in memory, that will have the same issue as the awk
you tried. So, another approach could be:
cat -n out.txt | sort -k2 -k1n | uniq -f1 | sort -nk1,1 | cut -f2-
How it works:
On a GNU system,
cat -n
will prepend the line number to each line following some amount of spaces and followed by a <tab> character.cat
pipes this input representation tosort
.sort
's-k2
option instructs it only to consider the characters from the second field until the end of the line when sorting, andsort
splits fields by default on white-space (orcat
's inserted spaces and <tab>).
When followed by-k1n
,sort
considers the 2nd field first, and then secondly—in the case of identical-k2
fields—it considers the 1st field but as sorted numerically. So repeated lines will be sorted together but in the order they appeared.- The results are piped to
uniq
—which is told to ignore the first field (-f1
- and also as separated by whitespace)—and which results in a list of unique lines in the original file and is piped back tosort
. - This time
sort
sorts on the first field (cat
's inserted line number) numerically, getting the sort order back to what it was in the original file and pipes these results tocut
. - Lastly,
cut
removes the line numbers that were inserted bycat
. This is effected bycut
printing only from the 2nd field through the end of the line (andcut
's default delimiter is a <tab> character).
To illustrate:
$ cat file
bb
aa
bb
dd
cc
dd
aa
bb
cc
$ cat -n file | sort -k2 | uniq -f1 | sort -k1 | cut -f2-
bb
aa
dd
cc
#!/usr/bin/perl
use DB_File;
tie %h, 'DB_File';
while(<>){ not $h{$_} and print and $h{$_}=1 }
EDIT 1: Does it really work? (comparing)
Sol1 : Terdon et all Schwartzian-transform-like one-liner
cat -n _1 | sort -uk2 | sort -nk1 | cut -f2-
Sol2 : perl + DB_File (this answer)
perl dbfile-uniq _1
Sol3 : PO (John W. Gill solution has a similar behavior)
awk '!seen[$0]++' _1
Sol4: Terdon perl
perl -ne 'print if ++$k{$_}==1' _1
Case1: 100_000_000 random numbers (5 digit each), 566Mbytes, 31_212 different values:
$ while true ; do echo $RANDOM; done | head -100000000 > _1
Case 2: 50_000_000 rand numbers (10 digits each), 516Mbytes, 48_351_464 different values:
$ shuf _1 | sed 'N;s/\n/ /' > _11
(the following numbers are not very precise):
┌────────┬────────┬────────────────┬────────┬──────┐
│ │ Sol1 │ Sol2 │ Sol3 │ Sol4 │
│ │ sort...│ perl DB │ awk │ perl │
├────────┼────────┼────────────────┼────────┼──────┤
│ case 1 │ 6m15 │ 6m17 │ 0m28 │ 0m28 │
├────────┼────────┼────────────────┼────────┴──────┤
│ case 2 │ 11m15 │ 81m44 │ out of memory │
├────────┼────────┼────────────────┼────────┬──────┤
│ case 2 │ │ 5m54 /cache=2G │ │ │
└────────┴────────┴────────────────┴────────┴──────┘
sol2 with cache is:
use DB_File;
use Fcntl ;
$DB_HASH->{'cachesize'} = 2000_000_000;
tie %h, 'DB_File', "_my.db", O_RDWR|O_CREAT|O_TRUNC, 0640, $DB_HASH;
while(<>){ not $h{$_} and print and $h{$_}=1 }
Sort can also be optimize adding a cachesize option (not done).
One quick conclusion:
sort
is a fantastic command!