Fastest way to sum Nth column in text file
GNU datamash
$ datamash -t, count 3 sum 3 < file
3,604720
Some testing
$ time gawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
604720000000 3000000
real 0m2.851s
user 0m2.784s
sys 0m0.068s
$ time mawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
6.0472e+11 3000000
real 0m0.967s
user 0m0.920s
sys 0m0.048s
$ time perl -F, -nle '$sum += $F[2] }{ print "$.,$sum"' longfile
3000000,604720000000
real 0m3.394s
user 0m3.364s
sys 0m0.036s
$ time { cut -d, -f3 <longfile |paste -s -d+ - |bc ; }
604720000000
real 0m1.679s
user 0m1.416s
sys 0m0.248s
$ time datamash -t, count 3 sum 3 < longfile
3000000,604720000000
real 0m0.815s
user 0m0.716s
sys 0m0.036s
So mawk
and datamash
appear to be the pick of the bunch.
Awk
is a fast and performant tool for processing text files.
awk -F',' '{ sum += $3 }
END{ printf "Sum of 3rd field: %d. Total number of lines: %d\n", sum, NR }' file
Sample output:
Sum of 3rd field: 604720. Total number of lines: 3
Conceptual note:
I must note that all those non-awk
alternatives are able to run faster only for such "ideal" numeric columns. It only costs for you to have a slightly more complex format (for ex. with some additional information to be stripped before calculation <1064458324:a,<38009543:b,<201507:c,<9:d,<0:e,<1:f,<1:g,1298
) and all those speed advantages will gone away (not to mention that some of them won't to able for perform the needed processing).