Delete duplicate lines pairwise?
I worked out the sed
answer not long after I posted this question; no one else has used sed
so far so here it is:
sed '$!N;/^\(.*\)\n\1$/d;P;D'
A little playing around with the more general problem (what about deleting lines in sets of three? Or four, or five?) provided the following extensible solution:
sed -e ':top' -e '$!{/\n/!{N;b top' -e '};};/^\(.*\)\n\1$/d;P;D' temp
Extended to remove triples of lines:
sed -e ':top' -e '$!{/\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1$/d;P;D' temp
Or to remove quads of lines:
sed -e ':top' -e '$!{/\n.*\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1\n\1$/d;P;D' temp
sed
has an additional advantage over most other options, which is its ability to truly operate in a stream, with no more memory storage needed than the actual number of lines to be checked for duplicates.
As cuonglm pointed out in the comments, setting the locale to C is necessary to avoid failures to properly remove lines containing multi-byte characters. So the commands above become:
LC_ALL=C sed '$!N;/^\(.*\)\n\1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/\n/!{N;b top' -e '};};/^\(.*\)\n\1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1$/d;P;D' temp
# Etc.
It's not very elegant, but it's as simple as I can come up with:
uniq -c input | awk '{if ($1 % 2 == 1) { print substr($0, 9) }}'
The substr() just trims off the uniq
output. That'll work until you have more than 9,999,999 duplicates of a line (in which case uniq's output may spill over 9 characters).
Give a try to this awk
script below:
#!/usr/bin/awk -f
{
if ((NR!=1) && (previous!=$0) && (count%2==1)) {
print previous;
count=0;
}
previous=$0;
count++;
}
END {
if (count%2==1) {
print previous;
}
}
It is assumed that the lines.txt
file is sorted.
The test:
$ chmod +x script.awk
$ ./script.awk lines.txt
a
d
e