What's an efficient way to calculate covariance for a large data set?

Check out How to calculate correlation accurately. There are two common formulas that are algebraically equivalent but one has much better numerical properties than the other.


The single-pass and parallel versions at Wikipedia may be what you're looking for. The single pass version is more numerically stable, but moves a division into the inner loop, which may hurt performance.


A single pass stable algorithm has been discovered in the time since this question has been originally answered:

Bennett, Janine, et al. "Numerically stable, single-pass, parallel statistics algorithms." Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on. IEEE, 2009.

An implementation is given in Boost.