Is it possible to speed up this code by eliminating "For" loop?

The sum can be split into

$$\sum_{n=1}^{N-m} (x_{n+ m } - x_{n})^2 = \sum_{n=1}^{N-m} x_{n+ m }^2 - 2 \sum_{n=1}^{N-m} x_{n+ m } x_{n} + \sum_{n=1}^{N-m} x_{n}^2.$$

If I got it correctly, then OP wants to do all the summations for $m=1,\dotsc, N-1$ and that $N$ is rather larger. So it might be worthwhile to consider other strategies than direct computations.

For example, the two outer sums above are easy to compute. One computes once a list of accumulates of the quadratic terms in $\Theta(N)$ time:

$$X_1 := x_1^2, \quad X_{i+1} := X_i+ x_{i+1}^2.$$

Then the first sum is just $\sum_{n=1}^{N-m} x_{n+ m }^2 = \sum_{n=m+1}^{N} x_{n}^2 = X_{N} - X_{m}$ and the third is $\sum_{n=1}^{N-m} x_{n}^2 = X_{N-m}$. So the outer sums for all $m$ together can be computed in $\Theta(N)$ time instead of $\Theta(N^2)$.

This would not help much if we would not be able to get a hold on the sums $y_m := \sum_{n=1}^{N-m} x_{n+ m } x_{n}$. However, these are of convolutionary nature. So something tells me that one should be able to speed this up by a clever application of the fast Fourier transform (or related methods just as ListConvolve or ListCorrelate), but I just cannot put the finger on it...

Addendum:

Finally I figured out how to use ListCorrelate. All what was needed was a suitable padding.

NN = 1000;
x = RandomReal[{-1, 1}, NN];
ytrue = Table[Sum[x[[m + n]] x[[n]], {n, 1, NN - m}], {m, 1, NN - 1}]; // AbsoluteTiming // First
yfaster = Table[x[[j ;;]].x[[;; -j]], {j, 2, NN}]; // AbsoluteTiming // First
yfastest = Rest@ListCorrelate[x, x, {1, 1}, ConstantArray[0., NN]]; // AbsoluteTiming // First
Max[Abs[ytrue - yfaster]]/Max[Abs[ytrue]]
Max[Abs[ytrue - yfastest]]/Max[Abs[ytrue]]

0.828028

0.002384

0.00013

1.15899*10^-15

1.21694*10^-15

Putting things together

The function ftrue does what you want to a single vector x in the straight-forward implementation.

ftrue = x \[Function] With[{NN = Length[x]},
    Table[
     1/(NN - m) Sum[(x[[n + m]] - x[[n]])^2, {n, 1, NN - m}],
     {m, 1, NN - 1}
     ]
    ];

Now the tuned version:

ffast = x \[Function] With[{X = Accumulate[x^2], NN = Length[x]},
   Plus[
     Subtract[Rest@Reverse[X] + X[[-1]], Most[X]],
     Rest@ListCorrelate[-2. x, x, {1, 1}, ConstantArray[0., NN]]
     ]/Range[NN - 1, 1, -1]
   ]

Accuracy and performance comparison:

x = RandomReal[{-1, 1}, 1000];
trueresult = ftrue[x]; // AbsoluteTiming // First
fastresult = ffast[x]; // AbsoluteTiming // First
Max[Abs[trueresult - fastresult]]/Max[Abs[trueresult]]

1.04949

0.002378

1.36369*10^-14

Superb. Since ffast has nearly linear scaling (thanks to the built-in FFT), we can do the job for a list of one million numbers in two and a quarter seconds:

x = RandomReal[{-1, 1}, 1000000];
fastresult = ffast[x]; // AbsoluteTiming // First

2.2488

That means that we can do the whole job in about four minutes

tablenumber = 100;
listnumber = 1000000;
t1 = N@RandomInteger[10, {tablenumber, listnumber}];
t3 = Sqrt@Map[ffast, t1]; // AbsoluteTiming // First

233.429

Given your t1, this will exactly reproduce t3 without needing any of the other code in your OP:

t3m = Table[Sqrt[Plus @@ ((#[[-xx ;;]] - #[[1 ;; xx]])^2)]/Sqrt[xx], 
            {xx, listnumber - 1, 1, -1}] & /@ t1;

For tablenumber and listnumber of 10 and 500 respectively, your code takes about 3 minutes on my laptop, the above takes about a quarter of a second. As the values are increased, this advantage grows.

That said, you're still talking about a lot of work for the desired numbers you mention in the OP.

What exactly is the purpose / end result of this calculation - there is probably a much more direct way of arriving at the result, but few will want to decode the meaning from the code.

Assuming you don't actually need arbitrary precision results, using:

cl = Compile[{{l, _Integer, 1}}, Module[{z, ll = Length@l},
    z = ConstantArray[0., ll - 1];
    For[k = 1, k < ll, k++,
     z[[k]] = 
      Sqrt[Total[(l[[k + 1 ;; ll]] - l[[;; -(k + 1)]])^2]]/
       Sqrt[ll - k]];
    z], CompilationTarget -> "WVM", RuntimeOptions -> "Speed"];

And then using it as:

result=ParallelMap[cl, t1]

should significantly speed things. You can try "C" instead of "WVM" as the compilation target if you have the needed ancillaries, though I doubt the simple construct will net much difference between the two.

Further optimizations give us:

cl2 = Compile[{{l, _Integer, 1}}, Module[{z, ll, k},
    ll = Length@l;
    z = ConstantArray[0., ll - 1];
    For[k = 1, k < ll, k++,
     z[[k]] = 
      Sqrt[Plus @@ ((l[[k + 1 ;; ll]] - l[[;; -(k + 1)]])^2)]/
       Sqrt[ll - k]];
    z], CompilationTarget -> "C"(*"WVM"*), RuntimeOptions -> "Speed", 
   CompilationOptions -> {"ExpressionOptimization" -> True, 
     "InlineCompiledFunctions" -> True, 
     "InlineExternalDefinitions" -> True}, 
   RuntimeAttributes -> {Listable}, Parallelization -> True];

By tweaking the code to avoid callbacks to the kernel in the heavy parts, compiling to C brings significant speedup.

This is used same as above, e.g.:

result=ParallelMap[cl2,t1];

This routine took ~1 hour (with other work going on simultaneously) on my laptop for a one million element list, so your 100 x 1000000 task over 4 parallel kernels should take ~1 day to finish, dependent o/c on your CPU.

I'll ponder further, but I don't think there's much more in this: You are, after all, calculating a result that has about half a trillion sums at its leaves...

Is it possible to speed up this code by eliminating "For" loop?

Addendum:

Putting things together

Tags:

Map

Performance Tuning

Parallelization

Related

Recent Posts