How to remove duplicate values based on multiple columns
Remove lines of which column 3, 4, 5 is the same:
awk '!($3==$4&&$4==$5)' data_file
Remove lines which has the same 3,4,5 columns with other line:
awk '!seen[$3,$4,$5]++' data_file
update for n columns
Remove lines of which column 3, 4, ...n is the same:
awk 'v=0;{for(i=4;i<=NF;i++) {if($i!=$3) {v=1; break;}}} v' data_file
v=0
reset v to 0 for every recordfor(i=4;i<=NF;i++) {if($i!=$3) {v=1; break;}}
loop from 4th column to last one, set v to 1 and break if it's not the same as 3rd columnv
print if v is not 0.
Remove lines which has the same 3,4,...n columns with other line:
awk '(l=$0) && ($1=$2=""); !seen[$0]++ {print l}' data_file
(l=$0) && ($1=$2="")
backup original line, empty 1st and 2nd columns, rebuild$0
. This expression always evaluated to false, so it won't print anything. Note that&&
take precedence over=
, that's why you need to()
them;!seen[$0]++ {print l}
usualseen
trick, print original line if it's unseen before.