remove duplicates based on the value of another column
Simple version:
sort input.txt | awk -F, '!a[$1]++'
"false" sorts alphabetically before "true," and the Awk command here just keeps the first row only for each distinct first field value.
If you want to keep "true" instead of "false," reverse sort it, pass it to the same Awk command, and reverse sort it again afterward.
awk -F, '$2 == "false" {data[$1]=$2 } $2=="true" { if ( data[$1]!="false" ) { data[$1]=$2 } } END { OFS=","; for (item in data) { print item,data[item] }}' input
To expand the script vertically for explanation:
BEGIN {
FS="," # Set the input separator; this is what -F, does.
}
$2 == "false" { # For any line whose second field is "false", we
data[$1]=$2 # will use that value no matter what.
}
$2=="true" { # For lines whose second field is "true",
if ( data[$1]!="false" ) { # only keep if if we haven't yet seen a
data[$1]=$2 # "false"
}
}
END { # Now that we have tabulated our data, we
OFS="," # can print it out by iterating through
for (item in data) { # the array we created.
print item,data[item]
}
}
perl -F, -lane '
exists $h{$F[0]} or $h[$h{$F[0]}=@h]=$_;
$h=$_; /,false$/ or $_=$h for $h[$h{$F[0]}];
END{ print for @h; }
' duplicates.file
Data structures:
- Hash
%h
whose keys are first fields (AAA, BBB, CCC, etc.) and corresponding values are numbers telling the order in which the keys were encountered. Thus, e.g., key AAA => 0, key BBB => 1, key CCC => 2. - Array
@h
whose elements are lines contained in the order of printing. So if both true and false are found in data, then the false value will go into the array. OTW, if there's one type of data, then that would be present.
Another way is using GNU sed:
sed -Ee '
G
/^([^,]*),(false|true)\n(.*\n)?\1,\2(\n|$)/ba
/^([^,]*)(,true)\n(.*\n)?\1,false(\n|$)/ba
/^([^,]*)(,false)\n((.*\n)?)\1,true(\n|$)/{
s//\3\1\2\5/;h;ba
}
s/([^\n]*)\n(.*)$/\2\n\1/;s/^\n*//
h;:a;$!d;g
' duplicates.file
FWIW, the POSIX equivalent code for the above GNU-sed code is listed below:
sed -e '
G
/^\([^,]*\),\(false\)\n\(.*\n\)\{0,1\}\1,\2$/ba
/^\([^,]*\),\(false\)\n\(.*\n\)\{0,1\}\1,\2\n/ba
/^\([^,]*\),\(true\)\n\(.*\n\)\{0,1\}\1,\2$/ba
/^\([^,]*\),\(true\)\n\(.*\n\)\{0,1\}\1,\2\n/ba
/^\([^,]*\),true\n\(.*\n\)\{0,1\}\1,false$/ba
/^\([^,]*\),true\n\(.*\n\)\{0,1\}\1,false\n/ba
/^\([^,]*\)\(,false\)\n\(\(.*\n\)\{0,1\}\)\1,true$/{
s//\3\1\2/
h
ba
}
/^\([^,]*\)\(,false\)\n\(\(.*\n\)\{0,1\}\)\1,true\n/{
s//\3\1\2\n/
h
ba
}
y/\n_/_\n/
s/\([^_]*\)_\(.*\)$/\2_\1/;s/^_*//
y/\n_/_\n/
h;:a;$!d;g
' duplicates.file
Explanation
- In this method we store the result to be finally printed in the hold space.
- For every line read, we append the hold space to the pattern space for examination of the current line vis-a-vis the existing state of the hold space.
- Now 5 things can possibly happen during this comparison:
- a) Current line matches somewhere in the hold line & false:false.
- [ACTION] Since same false state is found, then do nothing.
- b) Current line matches somewhere in the hold line & true:true.
- [ACTION] Since same true state is found, then do nothing.
- c) Current line matches somewhere in the hold line & true:false.
- [ACTION] Since a false state already exists, do nothing.
- d) Current line matches somewhere in the hold line & false:true.
- [ACTION] This involves some work, in that we need to replace the false line at the exact same position where the true is located.
- e) Current line DOES NOT match anywhere in the hold line.
- [ACTION] Move the present line to the end.
- a) Current line matches somewhere in the hold line & false:false.
Results
AA,false
BB,false
CC,false
DD,true