Fastest way to sum Nth column in text file

GNU datamash

$ datamash -t, count 3 sum 3 < file
3,604720

Some testing

$ time gawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
604720000000 3000000

real    0m2.851s
user    0m2.784s
sys     0m0.068s

$ time mawk -F',' '{ sum += $3 } END{ print sum, NR }' longfile
6.0472e+11 3000000

real    0m0.967s
user    0m0.920s
sys     0m0.048s

$ time perl -F, -nle '$sum += $F[2] }{ print "$.,$sum"' longfile
3000000,604720000000

real    0m3.394s
user    0m3.364s
sys     0m0.036s

$ time { cut -d, -f3 <longfile |paste -s -d+ - |bc ; }
604720000000

real    0m1.679s
user    0m1.416s
sys     0m0.248s

$ time datamash -t, count 3 sum 3 < longfile
3000000,604720000000

real    0m0.815s
user    0m0.716s
sys     0m0.036s

So mawk and datamash appear to be the pick of the bunch.

Awk is a fast and performant tool for processing text files.

awk -F',' '{ sum += $3 }
           END{ printf "Sum of 3rd field: %d. Total number of lines: %d\n", sum, NR }' file

Sample output:

Sum of 3rd field: 604720. Total number of lines: 3

Conceptual note:
I must note that all those non-awk alternatives are able to run faster only for such "ideal" numeric columns. It only costs for you to have a slightly more complex format (for ex. with some additional information to be stripped before calculation <1064458324:a,<38009543:b,<201507:c,<9:d,<0:e,<1:f,<1:g,1298) and all those speed advantages will gone away (not to mention that some of them won't to able for perform the needed processing).

Fastest way to sum Nth column in text file

Tags:

Awk

Sed

Text Processing

Shell Script

Related

Recent Posts