Why is the new PositionIndex horribly slow?
First let me note that I didn't write PositionIndex
, so I can't speak to its internals without doing a bit of digging (which at the moment I do not have time to do).
I agree performance could be improved in the case where there are many collisions. Let's quantify how bad the situation is, especially since complexity was mentioned!
We'll use the benchmarking tool in GeneralUtilities to plot time as a function of the size of the list:
Needs["GeneralUtilities`"]
myPosIdx[x_] := <|Thread[x[[#[[All, 1]]]] -> #]|> &@
GatherBy[Range@Length@x, x[[#]] &];
BenchmarkPlot[{PositionIndex, myPosIdx}, RandomInteger[100, #] &, 16, "IncludeFits" -> True]
which gives:
While PositionIndex wins for small lists (< 100 elements), it is substantially slower for large lists. It does still appear to be $O(n \log n)$, at least.
Let's choose a much larger random integer (1000000), so that we don't have any collisions:
Things are much better here. We can see that collisions are the main culprit.
Now lets see how the speed for a fixed-size list depends on the number of unique elements:
BenchmarkPlot[{PositionIndex, myPosIdx}, RandomInteger[#, 10^4] &,
2^{3, 4, 5, 6, 7, 8, 9, 10, 11, 12}]
Indeed, we can see that PositionIndex (roughly) gets faster as there are more and more unique elements, whereas myPosIdx gets slower. That makes sense, because PositionIndex is probably appending elements to each value in the association, and the fewer collisions the fewer (slow) appends will happen. Whereas myPosIdx is being bottlenecked by the cost of creating each equivalence class (which PositionIndex would no doubt be too, if it were faster). But this is all academic: PositionIndex
should be strictly faster than myPosIdx
, it is written in C.
We will fix this.
rcollyer pointed out in a comment that the the new GroupBy
may be substituted for GatherBy
in Szabolcs's original to produce the desired function:
cleanPosIdx[x_] := GroupBy[Range @ Length @ x, x[[#]] &]
I shall be using this code until PositionIndex
receives an enhancement.
Here's an alternative using the "GroupByList" resource function:
pIndex[i_List] := ResourceFunction["GroupByList"][Range @ Length @ i, i]
Comparison:
a = RandomInteger[10, 10^6];
r1 = pIndex[a]; //RepeatedTiming
r2 = myPosIdx[a]; //RepeatedTiming
r3 = PositionIndex[a]; //RepeatedTiming
r1 === r2 === r3
{0.02, Null}
{0.043, Null}
{0.11, Null}
True