How to speed up large double sums in a table?

Using only matrix operations makes this much faster (runtime scales quadratically with npart):

matrixM = With[{diff = Flatten[Outer[Subtract, vecs, vecs, 1], 1]},
  (Transpose[diff].diff)/npart];

For npart=10000 this will take a lot of memory though: the intermediate array diff will be huge.


update

After the discussions on the other solutions, it occurred to me that what's being calculated here is simply the covariance matrix of the vectors in vecs:

matrixM = 2*(npart-1)*Covariance[vecs]

Much much faster, no space issues, no stability issues!


Your double sum can be written in terms of single sums.

$$M_{mn}=2\sum_{i=1}^N r_m^i r_n^i-2\sum_{i=1}^N r_m^i \sum_{i=1}^N r_n^i/N$$

A theoretically equivalent but more numerically stable version of the above (suggested by @Roman) is given by

$$M_{mn}=2\sum_{i=1}^N (r_m^i - \bar{r}_m^i)(r_n^i-\bar{r}_n^i)$$

where $\bar{r}_k^i$ is the mean of the $r_k^i$ values.

nparticles = 1000;
SeedRandom[12345];
r = RandomVariate[UniformDistribution[], {nparticles, 3}];

m = 1; n = 2;
AbsoluteTiming[
 Sum[(r[[i, m]] - r[[j, m]]) (r[[i, n]] - r[[j, n]])/nparticles, 
   {i, nparticles}, {j, nparticles}]]
(* {3.4060337777777776`,-2.42222014120762`} *)

AbsoluteTiming[
 2 Total[(r[[All, m]] - Total[r[[All, m]]]/nparticles) (r[[All, n]] - 
      Total[r[[All, n]]]/nparticles)]]
(* {0.00008335802469135802`,-2.4222201412076174`} *)

Using single sums is about 40,000 times faster for 1,000 particles.


This is a nice application of low rank factorization by Adaptive Cross Approximation. We require the first code block from this post which defines the function ACACompression.

Now we can do this:

npart = 5000;
SeedRandom[1234];
vecs = RandomReal[{-1, 1}, {npart, 3}];

(* Roman's proposal*)
matrixM = With[{diff = Flatten[Outer[Subtract, vecs, vecs, 1], 1]}, 
  (Transpose[diff].diff)/npart
  ]; // AbsoluteTiming // First

r = Transpose[vecs];
matrixMACA = Table[
     Block[{x, y, col, row},
      x = r[[i]];
      y = r[[j]];
      col = row = k \[Function] (x[[k]] - x) (y[[k]] - y)/npart;
      {U, V} = ACACompression[row, col];
      Total[U, {2}].Total[V, {2}]
      ],
     {i, 1, 3}, {j, 1, 3}]; // AbsoluteTiming // First

Max[Abs[matrixMACA - matrixM]]/Max[Abs[matrixM]]

16.8389

0.076707

6.0471*10^-15

Thus, utilizing ACACompression is about 200 times faster for 5000 points, returning the result correctly up to machine precision. For 100000 uniformly distributed points, it needs about 1 second. For 1000000 uniformly distributed points, it needs about 20 second (probably due to a memory bottlenecks).

The efficiency and accuracy depend quite much on the distribution of points, but one does not require overly uniformly distributed point clouds in order to make that work.

When the points are not in remarkably ill-poised, ACACompression will terminate with an almost exact low-rank factorization of the matrix that contains the summands of the double sum. For $N$ particles, ACACompression will need roughly $O(N)$ time and $O(N)$ memory for the factorization -- in contrast to Outer and the double Sums themselves which need at least $O(N^2)$ time. More severely, Outer needs also $O(N^2)$ memory. This is why using Outer becomes infeasible quite qickly for increasing $N$.