Remove partial duplicates consecutive lines but keep first and last

uniq is (sort of) the perfect tool for this, by default in uniq you can keep/show the first but not last line in set.

uniq has a -f flag which allows you to skip the first few fields.

From man uniq:

   -f, --skip-fields=N
          avoid comparing the first N fields

   -s, --skip-chars=N
          avoid comparing the first N characters

   A field is a run of blanks (usually spaces and/or TABs), then non-blank characters.  Fields are skipped before chars.

Example with uniq -c to show count see what uniq is doing:

-bash-4.2$ uniq -c -f 1 original_file
  1 1447790360      99999   99999   20.25   20.25   20.25   20.50
  9 1447790362      20.25   20.25   20.25   20.25   20.25   20.50
  1 1447790388      20.25   20.25   99999   99999   99999   99999
  1 1447790389      99999   99999   20.25   20.25   20.25   20.50
  1 1447790391      20.00   20.25   20.25   20.25   20.25   20.50
  3 1447790394      20.25   20.25   20.25   20.25   20.25   20.50

Not bad. Pretty close to what is wanted. And easy to do. But missing the last matching line in group . . . .

The grouping options in uniq are also interesting for this question . . .

   --group[=METHOD]
          show all items, separating groups with an empty line METHOD={separate(default),prepend,append,both}

   -D, --all-repeated[=METHOD]
          print all duplicate lines groups can be delimited with an empty line METHOD={none(default),prepend,separate}

Example, uniq by group . . .

    -bash-4.2$ uniq --group=both -f 1 original_file 

1447790360      99999   99999   20.25   20.25   20.25   20.50

1447790362      20.25   20.25   20.25   20.25   20.25   20.50
1447790365      20.25   20.25   20.25   20.25   20.25   20.50
1447790368      20.25   20.25   20.25   20.25   20.25   20.50
1447790371      20.25   20.25   20.25   20.25   20.25   20.50
1447790374      20.25   20.25   20.25   20.25   20.25   20.50
1447790377      20.25   20.25   20.25   20.25   20.25   20.50
1447790380      20.25   20.25   20.25   20.25   20.25   20.50
1447790383      20.25   20.25   20.25   20.25   20.25   20.50
1447790386      20.25   20.25   20.25   20.25   20.25   20.50

1447790388      20.25   20.25   99999   99999   99999   99999

1447790389      99999   99999   20.25   20.25   20.25   20.50

1447790391      20.00   20.25   20.25   20.25   20.25   20.50

1447790394      20.25   20.25   20.25   20.25   20.25   20.50
1447790397      20.25   20.25   20.25   20.25   20.25   20.50
1447790400      20.25   20.25   20.25   20.25   20.25   20.50

Then grep for line before and after every empty line and strip blank lines:

-bash-4.2$ uniq --group=both -f 1 original_file |grep -B1 -A1 ^$ |grep -Ev "^$|^--$"
1447790360      99999   99999   20.25   20.25   20.25   20.50
1447790362      20.25   20.25   20.25   20.25   20.25   20.50
1447790386      20.25   20.25   20.25   20.25   20.25   20.50
1447790388      20.25   20.25   99999   99999   99999   99999
1447790389      99999   99999   20.25   20.25   20.25   20.50
1447790391      20.00   20.25   20.25   20.25   20.25   20.50
1447790394      20.25   20.25   20.25   20.25   20.25   20.50
1447790400      20.25   20.25   20.25   20.25   20.25   20.50

Tah dahhh! Pretty good.

With awk one liner:

awk '{n=$2$3$4$5$6$7}l1!=n{if(p)print l0; print; p=0}l1==n{p=1}{l0=$0; l1=n}END{print}' file

The whole point is to manipulate few variables: n stores all fields except first in current line, l1 the same for previous line and l0 the whole previous line. The p is just a flag to mark if previous line was already printed.

Remove partial duplicates consecutive lines but keep first and last

Tags:

Awk

Sed

Text Processing

Related

Recent Posts