Efficiently finding the positions of a large list of targets in another, even larger list
I will use big
and small
rather than bigList
and smallList
, for brevity.
As stated by others if you can select the positions at random in the first place this will be faster, e.g.:
pos = RandomSample[Range @ Length @ big, 1200];
You can then get the small list with: small = big[[pos]]
.
To carry out the specific operation you describe the key detail will be to not scan the big
list again for every element of small
but rather find a way to scan it only once.
One of the simplest methods is to use MapIndexed
to build a list of replacement rules that yield positions from elements. MapIndexed
is easily extendable to array and tensor elements as well. Dispatch
can be used to optimize this list of replacement rules.
Example data:
big = Array[{"x", "y", "z"} # &, 60000];
small = RandomSample[big, 1200];
Build the dispatch table:
rls = MapIndexed[Rule, big] // Dispatch;
Find the positions:
pos = small /. rls;
Together both operations take less than a tenth of a second on my machine:
First @ Timing[small /. Dispatch @ MapIndexed[Rule, big];]
0.0966
Confirmation of accuracy:
Extract[big, pos] === small
True
Lists with duplicates
The method above assumes that your big
list does not have any duplicate elements; that is, there is a one-to-one mapping of element and position. If there are duplicates you will need an additional GatherBy
step as follows.
Data with duplicates:
big = Array[{"x", "y", "z"} Mod[#, 5000] &, 60000];
small = RandomSample[big, 1200];
Build rules, Gather them, and build the Dispatch table:
rls2 =
Dispatch[
#[[1, 1]] -> #[[All, 2, 1]] & /@
Rule ~MapIndexed~ big ~GatherBy~ First
];
Find the positions in big
at which the first element of small
appears:
First @ small /. rls2
{2659, 7659, 12659, 17659, 22659, 27659, 32659, 37659, 42659, 47659, 52659, 57659}
Instead of GatherBy
you could also use Sow
and Reap
, e.g.:
rls2 = Dispatch @ Reap[MapIndexed[Sow[#2, #] &, big], _, # -> Flatten@#2 &][[2]];
Alternative method
To provide an alternative approach you can use DownValues definitions in place of the Dispatch table. I shall illustrate the method for a list with duplicates as that is more interesting.
Data:
SeedRandom[1]
big = RandomInteger[9, 50];
small = {6, 3, 9, 7, 2};
Build the definitions:
Clear[posfn]
posfn[_] = {};
AppendTo[posfn[#], #2[[1]]] & ~MapIndexed~ big;
Test the function:
posfn /@ small
{{8, 20, 23}, {17, 27, 30, 32, 34, 36, 42}, {37, 43}, {4}, {18, 22, 35, 41, 50}}
Confirmation:
big[[ posfn[3] ]]
{3, 3, 3, 3, 3, 3, 3}
Prompted by comments conversation with Mr. Wizard, a routine I use:
findMultiPosXX[list_, find_, allowBits_: False, skipCands_: True] :=
Module[{f = DeleteDuplicates[find], o, l, oo, bitmax = 20, cands, dims},
If[allowBits && Length@f <= bitmax,
With[{r = If[Length@(dims = Dimensions@list) == 1, Range@Length@list,
Array[List, dims]]}, Pick[r, BitXor[list, #], 0] & /@ f],
If[skipCands,
(* This is the core explained below *)
(l = Length@f;
oo = Ordering[o = Ordering[Join[f, list, f]]];
Inner[o[[# ;; #2]] &, oo[[;; l]], oo[[-l ;;]], List][[All, 2 ;; -2]] - l),
(* end of core method *)
(cands =
SparseArray[
BitAnd[UnitStep[list - Min@f], UnitStep[Max@f - list]]]["NonzeroPositions"];
cands[[#]] & /@ (l = Length@f;
oo = Ordering[o = Ordering[Join[f, Extract[list, cands], f]]];
Inner[o[[# ;; #2]] &, oo[[;; l]], oo[[-l ;;]], List][[All, 2 ;; -2]] - l))]]];
Use:
test = RandomInteger[10, 20]
findMultiPosXX[test, {5, 10, 8}]
(*
{10, 0, 1, 4, 10, 5, 7, 2, 7, 7, 1, 4, 9, 3, 3, 7, 7, 8, 2, 5}
{{6, 20}, {1, 5}, {18}}
*)
With default argument (just the target and searches), can beat GatherBy
by order of magnitude, and Position
with Alternatives
(which leaves you with only positions but no mapping to values) by a factor of 5 or more.
When searching for scalar values at the top level, setting the third argument to True
will use bit-map search for 20 or fewer search terms, upping the performance typically by 4-10X. Also for scalar searches, if one knows the search space spans a small range of the data range, setting the fourth argument to False
causes the search space to be narrowed, boosting performance from 2X to over an order of magnitude.
E.g., testing against several of the answers and Position
with Alternatives
, using the OP specified sizes with test = RandomInteger[1000000, {60000, 3}]
and
finds = RandomSample[test, 1200]
, the speed advantage results expressed as time/mytime were {7.4286, 6.1429, 20.857, 21.143, 419.71}
.
An explanation of the core method (when no optional arguments are set):
Let's take an example list:
test = RandomInteger[5, 20]
(* {5, 3, 2, 5, 0, 4, 3, 4, 4, 0, 0, 5, 4, 0, 5, 5, 5, 4, 0, 2} *)
Say we want to look for threes and fives. We'll call that lookfor
, and prepend/append that to the target list:
lookfor = {3, 5}
work = Join[lookfor, test, lookfor]
(*
{3, 5}
{3, 5, 5, 3, 2, 5, 0, 4, 3, 4, 4, 0, 0, 5, 4, 0, 5, 5, 5, 4, 0, 2, 3,5}
*)
Now, let's use Ordering
to get the positions from work
in its sorted incarnation, along with the ordering of that order. That is just the permutation of the range of the length of work by a permutation list of the order (you can do it that way, but somewhat surprisingly, using Ordering
on an order is faster):
Column[{order = Ordering[work],orderOforder = Ordering[order]}, Left, 2]
(*
{7,12,13,16,21,5,22,1,4,9,23,8,10,11,15,20,2,3,6,14,17,18,19,24}
{8,17,18,9,6,19,1,12,10,13,14,2,3,20,15,4,21,22,23,16,5,7,11,24}
*)
Now, here's the "magic". Take note of the first and last two entries of the permutation (two being the length of our lookfor
list):
lengthOflookfor = Length@lookfor;
transposed = Transpose[{orderOforder[[;; lengthOflookfor]],
orderOforder[[-lengthOflookfor ;;]]}]
(* {{8, 11}, {17, 24}} *)
These pairs are in essence nothing more than spans, corresponding to the positions, for each of our lookfor
elements, in the first ordering, mapping us the corresponding position(s) if any in the work list:
Column[{span = Span @@@ transposed, places = order[[#]] & /@ span,
work[[#]] & /@ places}, Left, 2]
(*
{8;;11,17;;24}
{{1,4,9,23},{2,3,6,14,17,18,19,24}}
{{3,3,3,3},{5,5,5,5,5,5,5,5}}
*)
So we see that following the order from positions 8 to 11 and 17 to 24, we recover the positions of the elements we're interested in from the work list.
All that's left to do is to trim our padded element entries from the result and adjust the positions to get the actual positions from the original list:
Column[{recovered = ArrayPad[#, -1] & /@ places - lengthOflookfor,
test[[#]] & /@ recovered}, Left, 2]
(*
{{2,7},{1,4,12,15,16,17}}
{{3,3},{5,5,5,5,5,5}}
*)
Because Ordering
is, in general, very fast, this trick allows us to recover the positions of a large subset of elements (or show non-existence) from the target list very quickly.
bigList = RandomSample[DeleteDuplicates[RandomInteger[{1, 255}, {70000, 3}]], 60000];
smallList = RandomSample[bigList, 1200];
First@AbsoluteTiming[Map[Position[bigList, #] &, smallList]]
needs less than a second on my machine.