Delete duplicate lines pairwise?

I worked out the sed answer not long after I posted this question; no one else has used sed so far so here it is:

sed '$!N;/^\(.*\)\n\1$/d;P;D'

A little playing around with the more general problem (what about deleting lines in sets of three? Or four, or five?) provided the following extensible solution:

sed -e ':top' -e '$!{/\n/!{N;b top' -e '};};/^\(.*\)\n\1$/d;P;D' temp

Extended to remove triples of lines:

sed -e ':top' -e '$!{/\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1$/d;P;D' temp

Or to remove quads of lines:

sed -e ':top' -e '$!{/\n.*\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1\n\1$/d;P;D' temp

sed has an additional advantage over most other options, which is its ability to truly operate in a stream, with no more memory storage needed than the actual number of lines to be checked for duplicates.

As cuonglm pointed out in the comments, setting the locale to C is necessary to avoid failures to properly remove lines containing multi-byte characters. So the commands above become:

LC_ALL=C sed '$!N;/^\(.*\)\n\1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/\n/!{N;b top' -e '};};/^\(.*\)\n\1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1$/d;P;D' temp
# Etc.

It's not very elegant, but it's as simple as I can come up with:

uniq -c input | awk '{if ($1 % 2 == 1) { print substr($0, 9) }}'

The substr() just trims off the uniq output. That'll work until you have more than 9,999,999 duplicates of a line (in which case uniq's output may spill over 9 characters).

Give a try to this awk script below:

#!/usr/bin/awk -f
{
  if ((NR!=1) && (previous!=$0) && (count%2==1)) {
    print previous;
    count=0;
  }
  previous=$0;
  count++;
}
END {
  if (count%2==1) {
    print previous;
  }
}

It is assumed that the lines.txt file is sorted.

The test:

$ chmod +x script.awk
$ ./script.awk lines.txt
a
d
e

Delete duplicate lines pairwise?

Tags:

Sed

Text Processing

Uniq

Related

Recent Posts