Count the number of rows with a string occurring n times in multiple columns
Yes, you can do this in awk
:
awk '{
k=0;
for(i=2;i<=NF;i++){
if($i == 0){
k++
}
}
if(k==3){
tot++
}
}
END{
print tot
}' file
And also with (GNU) sed
and wc
:
$ sed -nE '/\b0\b.*\b0\b.*\b0\b/p' file | wc -l
7
But, personally, I would do in in perl instead:
$ perl -ale '$tot++ if (grep{$_ == 0 } @F) == 3 }{ print $tot' file
7
Or, the slightly less condensed:
$ perl -ale 'if( (grep{$_ == 0 } @F) == 3 ){
$tot++
}
END{
print $tot
}' file
7
And the same thing, for the golfers among you:
$ perl -ale '(grep{$_==0}@F)==3&&$t++}{print$t' file
7
Explanation
-ale
:-a
makes perl behave like awk. It will read each line of the input file and split it on whitespace into the array@F
. The-l
adds a\n
to each call ofprint
and removes trailing newlines from the input and the-e
is the script that should be applied to each line of input.$tot++ if (grep{$_ == 0 } @F) == 3
: increment$tot
by one, for every time where there are exactly 3 fields that are0
. Since the 1st field starts from 1, we know it will never be 0 so we don't need to exclude it.}{
: this is just a shorthand way of writingEND{}
, of giving a block of code that will be executed after the file has been processed. So,}{ print $tot
will print the total number of lines with exactly three fields with a value of0
.
With GNU grep
or ripgrep
$ LC_ALL=C grep -c $'\t''0\b.*\b0\b.*\b0\b' ip.txt
7
$ rg -c '\t0\b.*\b0\b.*\b0\b' ip.txt
7
where $'\t'
will match tab character, thus working even if first column is 0
.
Sample run with large file:
$ perl -0777 -ne 'print $_ x 1000000' ip.txt > f1
$ du -h f1
92M f1
$ time LC_ALL=C grep -c $'\t''0\b.*\b0\b.*\b0\b' f1 > f2
real 0m0.416s
$ time rg -c '\t0\b.*\b0\b.*\b0\b' f1 > f3
real 0m1.271s
$ time LC_ALL=C awk 'gsub(/\t0/,"")==3{c++} END{print c+0}' f1 > f4
real 0m8.645s
$ time perl -ale '$tot++ if (grep{$_ == 0 } @F) == 3 }{ print $tot' f1 > f5
real 0m14.349s
$ time LC_ALL=C sed -n 's/\t0\>//4;t;s//&/3p' f1 | wc -l > f6
real 0m14.075s
$ time LC_ALL=C sed -n 's/\t0\>/&/3p' f1 | wc -l > f8
real 0m6.772s
$ time LC_ALL=C awk '{
k=0;
for(i=2;i<=NF;i++){
if($i == 0){
k++
}
}
if(k==3){
tot++
}
}
END{
print tot
}' f1 > f7
real 0m10.675s
Remove LC_ALL=C
if file can contain non-ASCII characters. ripgrep
is usually faster than GNU grep
but in test run GNU grep
was faster. As per ripgrep
's author, (?-u:\b)
can be used to avoid unicode word boundary, but that resulted in similar time for above case.
$ awk 'gsub(/\t0/,"")==3{c++} END{print c+0}' file
7