Is it possible to speed up this code by eliminating "For" loop?
The sum can be split into
$$\sum_{n=1}^{N-m} (x_{n+ m } - x_{n})^2 = \sum_{n=1}^{N-m} x_{n+ m }^2 - 2 \sum_{n=1}^{N-m} x_{n+ m } x_{n} + \sum_{n=1}^{N-m} x_{n}^2.$$
If I got it correctly, then OP wants to do all the summations for $m=1,\dotsc, N-1$ and that $N$ is rather larger. So it might be worthwhile to consider other strategies than direct computations.
For example, the two outer sums above are easy to compute. One computes once a list of accumulates of the quadratic terms in $\Theta(N)$ time:
$$X_1 := x_1^2, \quad X_{i+1} := X_i+ x_{i+1}^2.$$
Then the first sum is just $\sum_{n=1}^{N-m} x_{n+ m }^2 = \sum_{n=m+1}^{N} x_{n}^2 = X_{N} - X_{m}$ and the third is $\sum_{n=1}^{N-m} x_{n}^2 = X_{N-m}$. So the outer sums for all $m$ together can be computed in $\Theta(N)$ time instead of $\Theta(N^2)$.
This would not help much if we would not be able to get a hold on the sums $y_m := \sum_{n=1}^{N-m} x_{n+ m } x_{n}$. However, these are of convolutionary nature. So something tells me that one should be able to speed this up by a clever application of the fast Fourier transform (or related methods just as ListConvolve
or ListCorrelate
), but I just cannot put the finger on it...
Addendum:
Finally I figured out how to use ListCorrelate
. All what was needed was a suitable padding.
NN = 1000;
x = RandomReal[{-1, 1}, NN];
ytrue = Table[Sum[x[[m + n]] x[[n]], {n, 1, NN - m}], {m, 1, NN - 1}]; // AbsoluteTiming // First
yfaster = Table[x[[j ;;]].x[[;; -j]], {j, 2, NN}]; // AbsoluteTiming // First
yfastest = Rest@ListCorrelate[x, x, {1, 1}, ConstantArray[0., NN]]; // AbsoluteTiming // First
Max[Abs[ytrue - yfaster]]/Max[Abs[ytrue]]
Max[Abs[ytrue - yfastest]]/Max[Abs[ytrue]]
0.828028
0.002384
0.00013
1.15899*10^-15
1.21694*10^-15
Putting things together
The function ftrue
does what you want to a single vector x
in the straight-forward implementation.
ftrue = x \[Function] With[{NN = Length[x]},
Table[
1/(NN - m) Sum[(x[[n + m]] - x[[n]])^2, {n, 1, NN - m}],
{m, 1, NN - 1}
]
];
Now the tuned version:
ffast = x \[Function] With[{X = Accumulate[x^2], NN = Length[x]},
Plus[
Subtract[Rest@Reverse[X] + X[[-1]], Most[X]],
Rest@ListCorrelate[-2. x, x, {1, 1}, ConstantArray[0., NN]]
]/Range[NN - 1, 1, -1]
]
Accuracy and performance comparison:
x = RandomReal[{-1, 1}, 1000];
trueresult = ftrue[x]; // AbsoluteTiming // First
fastresult = ffast[x]; // AbsoluteTiming // First
Max[Abs[trueresult - fastresult]]/Max[Abs[trueresult]]
1.04949
0.002378
1.36369*10^-14
Superb. Since ffast
has nearly linear scaling (thanks to the built-in FFT), we can do the job for a list of one million numbers in two and a quarter seconds:
x = RandomReal[{-1, 1}, 1000000];
fastresult = ffast[x]; // AbsoluteTiming // First
2.2488
That means that we can do the whole job in about four minutes
tablenumber = 100;
listnumber = 1000000;
t1 = N@RandomInteger[10, {tablenumber, listnumber}];
t3 = Sqrt@Map[ffast, t1]; // AbsoluteTiming // First
233.429
Given your t1
, this will exactly reproduce t3
without needing any of the other code in your OP:
t3m = Table[Sqrt[Plus @@ ((#[[-xx ;;]] - #[[1 ;; xx]])^2)]/Sqrt[xx],
{xx, listnumber - 1, 1, -1}] & /@ t1;
For tablenumber
and listnumber
of 10
and 500
respectively, your code takes about 3 minutes on my laptop, the above takes about a quarter of a second. As the values are increased, this advantage grows.
That said, you're still talking about a lot of work for the desired numbers you mention in the OP.
What exactly is the purpose / end result of this calculation - there is probably a much more direct way of arriving at the result, but few will want to decode the meaning from the code.
Assuming you don't actually need arbitrary precision results, using:
cl = Compile[{{l, _Integer, 1}}, Module[{z, ll = Length@l},
z = ConstantArray[0., ll - 1];
For[k = 1, k < ll, k++,
z[[k]] =
Sqrt[Total[(l[[k + 1 ;; ll]] - l[[;; -(k + 1)]])^2]]/
Sqrt[ll - k]];
z], CompilationTarget -> "WVM", RuntimeOptions -> "Speed"];
And then using it as:
result=ParallelMap[cl, t1]
should significantly speed things. You can try "C"
instead of "WVM"
as the compilation target if you have the needed ancillaries, though I doubt the simple construct will net much difference between the two.
Further optimizations give us:
cl2 = Compile[{{l, _Integer, 1}}, Module[{z, ll, k},
ll = Length@l;
z = ConstantArray[0., ll - 1];
For[k = 1, k < ll, k++,
z[[k]] =
Sqrt[Plus @@ ((l[[k + 1 ;; ll]] - l[[;; -(k + 1)]])^2)]/
Sqrt[ll - k]];
z], CompilationTarget -> "C"(*"WVM"*), RuntimeOptions -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True,
"InlineCompiledFunctions" -> True,
"InlineExternalDefinitions" -> True},
RuntimeAttributes -> {Listable}, Parallelization -> True];
By tweaking the code to avoid callbacks to the kernel in the heavy parts, compiling to C brings significant speedup.
This is used same as above, e.g.:
result=ParallelMap[cl2,t1];
This routine took ~1 hour (with other work going on simultaneously) on my laptop for a one million element list, so your 100 x 1000000 task over 4 parallel kernels should take ~1 day to finish, dependent o/c on your CPU.
I'll ponder further, but I don't think there's much more in this: You are, after all, calculating a result that has about half a trillion sums at its leaves...