Getting lengths of sublists that sum to more than one
Very good question / problem. Generally, this problem seems to belong to the class of problems where Compile
is the best choice if maximum efficiency is looked for, since it is, by its nature, not a good fit for the Mathematica paradigm of working with lots of data at once. However, your last solution can be, in a somewhat modified form, brought to the same performance level while only partly compiled:
Clear[split, thresholdSemiCompiled];
split = Compile[{{nums, _Real, 1}},
FoldList[If[# > 1, 0, #] &[#1 + #2] &, 0.0, nums],
CompilationTarget -> "C"];
thresholdSemiCompiled[l_List] :=
Flatten@Differences[SparseArray[Unitize[split[l]] - 1]["NonzeroPositions"]]
It turns out that Clip
was a major bottleneck in the compiled split
, and once I replaced it with If
, it became an order of magnitude faster. This one is on par with your fastest fully compiled solution, in terms of performance.
Edit (by Oleksandr R., at Leonid's request)
In version 8, provided that an integer argument is used rather than a pattern, Position
can be compiled down into a call to a new kernel/VM function "Position"
. This allows the above to be implemented in purely compiled code and avoids the need to use SparseArray
in an undocumented and perhaps unintuitive way to get the positions of the zeros without unpacking the input array:
thresholdCompiled = Compile[{{nums, _Real, 1}},
Flatten@Differences@Position[
Unitize@FoldList[If[# > 1, 0, #] &[#1 + #2] &, 0.0, nums],
0
], CompilationTarget -> "C"
];
The performance gained by doing this is, perhaps surprisingly, relatively minor: this version returns timings of about 7/8ths those of the semi-compiled approach, irrespective of compilation to MVM bytecode or to C, probably due to the ability to locate zeros directly rather than having to subtract 1 from the array and find nonzero elements, and by avoiding the need to return a large intermediate array. (Unitize
, Differences
, Position
, and Flatten
are all compilable, but in version 8 they are compiled into simple function calls, so the only situation in which we would expect a performance gain through compilation is when, as here, it allows a larger problem to be dealt with entirely in compiled code.)
Possibly not fastest, but concise and using a simple iteration over the list. Or two iterations, if we do a cleanup step as in the Compile'd variants.
subseqlens[ll_] := Reap[Module[{n = 0, tot = 0.},
Map[(n++; tot += #; If[tot > 1, tot = 0.; Sow[n]; n = 0]) &,
ll]]][[-1, 1]]
subseqlensC = Compile[{{ll, _Real, 1}}, Module[{n = 0, tot = 0., m},
Select[
Map[(n++; tot += #; If[tot > 1, tot = 0.; m = n; n = 0; m, 0]) &,
ll], # != 0 &]]];
subseqlensCC = Compile[{{ll, _Real, 1}}, Module[{n = 0, tot = 0., m},
Select[
Map[(n++; tot += #; If[tot > 1, tot = 0.; m = n; n = 0; m, 0]) &,
ll], # != 0 &]], CompilationTarget -> "C"];
Here is a fairly big example.
In[461]:= biglist = RandomReal[1, 10^6];
Timing[ss1 = subseqlens[biglist];]
Timing[ss2 = subseqlensC[biglist];]
Timing[ss3 = subseqlensCC[biglist];]
ss1 === ss2 === ss3
Out[462]= {2.92, Null}
Out[463]= {0.26, Null}
Out[464]= {0.04, Null}
Out[465]= True
Using Fold
with Reap
and Sow
might be cleaner:
ClearAll[f];
f = If[Total[{##}, Infinity] > 1, Sow[Length[Flatten[{##}]]]; ## &[], {##}] &;
Fold[f, First@list, Rest@list]; // Reap // Last
(* Out[1]= {{3, 2, 3}} *)