Efficiently finding the positions of a large list of targets in another, even larger list

I will use big and small rather than bigList and smallList, for brevity.

As stated by others if you can select the positions at random in the first place this will be faster, e.g.:

pos = RandomSample[Range @ Length @ big, 1200];

You can then get the small list with: small = big[[pos]].

To carry out the specific operation you describe the key detail will be to not scan the big list again for every element of small but rather find a way to scan it only once.

One of the simplest methods is to use MapIndexed to build a list of replacement rules that yield positions from elements. MapIndexed is easily extendable to array and tensor elements as well. Dispatch can be used to optimize this list of replacement rules.

Example data:

big = Array[{"x", "y", "z"} # &, 60000];

small = RandomSample[big, 1200];

Build the dispatch table:

rls = MapIndexed[Rule, big] // Dispatch;

Find the positions:

pos = small /. rls;

Together both operations take less than a tenth of a second on my machine:

First @ Timing[small /. Dispatch @ MapIndexed[Rule, big];]

0.0966

Confirmation of accuracy:

Extract[big, pos] === small

True

Lists with duplicates

The method above assumes that your big list does not have any duplicate elements; that is, there is a one-to-one mapping of element and position. If there are duplicates you will need an additional GatherBy step as follows.

Data with duplicates:

big = Array[{"x", "y", "z"} Mod[#, 5000] &, 60000];
small = RandomSample[big, 1200];

Build rules, Gather them, and build the Dispatch table:

rls2 =
 Dispatch[
   #[[1, 1]] -> #[[All, 2, 1]] & /@
     Rule ~MapIndexed~ big ~GatherBy~ First
 ];

Find the positions in big at which the first element of small appears:

First @ small /. rls2

{2659, 7659, 12659, 17659, 22659, 27659, 32659, 37659, 42659, 47659, 52659, 57659}

Instead of GatherBy you could also use Sow and Reap, e.g.:

rls2 = Dispatch @ Reap[MapIndexed[Sow[#2, #] &, big], _, # -> Flatten@#2 &][[2]];

Alternative method

To provide an alternative approach you can use DownValues definitions in place of the Dispatch table. I shall illustrate the method for a list with duplicates as that is more interesting.

Data:

SeedRandom[1]
big = RandomInteger[9, 50];
small = {6, 3, 9, 7, 2};

Build the definitions:

Clear[posfn]

posfn[_] = {};

AppendTo[posfn[#], #2[[1]]] & ~MapIndexed~ big;

Test the function:

posfn /@ small

{{8, 20, 23}, {17, 27, 30, 32, 34, 36, 42}, {37, 43}, {4}, {18, 22, 35, 41, 50}}

Confirmation:

big[[ posfn[3] ]]

{3, 3, 3, 3, 3, 3, 3}

Prompted by comments conversation with Mr. Wizard, a routine I use:

findMultiPosXX[list_, find_, allowBits_: False, skipCands_: True] := 
 Module[{f = DeleteDuplicates[find], o, l, oo, bitmax = 20, cands, dims},

  If[allowBits && Length@f <= bitmax,
   With[{r = If[Length@(dims = Dimensions@list) == 1, Range@Length@list, 
                Array[List, dims]]}, Pick[r, BitXor[list, #], 0] & /@ f],

   If[skipCands,

    (* This is the core explained below *)

    (l = Length@f;
     oo = Ordering[o = Ordering[Join[f, list, f]]]; 
     Inner[o[[# ;; #2]] &, oo[[;; l]], oo[[-l ;;]], List][[All, 2 ;; -2]] - l),

    (* end of core method *)

    (cands = 
      SparseArray[
        BitAnd[UnitStep[list - Min@f], UnitStep[Max@f - list]]]["NonzeroPositions"];

     cands[[#]] & /@ (l = Length@f;
       oo = Ordering[o = Ordering[Join[f, Extract[list, cands], f]]]; 
       Inner[o[[# ;; #2]] &, oo[[;; l]], oo[[-l ;;]], List][[All, 2 ;; -2]] - l))]]];

Use:

test = RandomInteger[10, 20]

findMultiPosXX[test, {5, 10, 8}]

(*

{10, 0, 1, 4, 10, 5, 7, 2, 7, 7, 1, 4, 9, 3, 3, 7, 7, 8, 2, 5}

{{6, 20}, {1, 5}, {18}}

*)

With default argument (just the target and searches), can beat GatherBy by order of magnitude, and Position with Alternatives (which leaves you with only positions but no mapping to values) by a factor of 5 or more.

When searching for scalar values at the top level, setting the third argument to True will use bit-map search for 20 or fewer search terms, upping the performance typically by 4-10X. Also for scalar searches, if one knows the search space spans a small range of the data range, setting the fourth argument to False causes the search space to be narrowed, boosting performance from 2X to over an order of magnitude.

E.g., testing against several of the answers and Position with Alternatives, using the OP specified sizes with test = RandomInteger[1000000, {60000, 3}] and finds = RandomSample[test, 1200], the speed advantage results expressed as time/mytime were {7.4286, 6.1429, 20.857, 21.143, 419.71}.

An explanation of the core method (when no optional arguments are set):

Let's take an example list:

test = RandomInteger[5, 20]

(* {5, 3, 2, 5, 0, 4, 3, 4, 4, 0, 0, 5, 4, 0, 5, 5, 5, 4, 0, 2} *)

Say we want to look for threes and fives. We'll call that lookfor, and prepend/append that to the target list:

lookfor = {3, 5}
work = Join[lookfor, test, lookfor]

(*
   {3, 5}

   {3, 5, 5, 3, 2, 5, 0, 4, 3, 4, 4, 0, 0, 5, 4, 0, 5, 5, 5, 4, 0, 2, 3,5}
*)

Now, let's use Ordering to get the positions from work in its sorted incarnation, along with the ordering of that order. That is just the permutation of the range of the length of work by a permutation list of the order (you can do it that way, but somewhat surprisingly, using Ordering on an order is faster):

Column[{order = Ordering[work],orderOforder = Ordering[order]}, Left, 2]

(* 
  {7,12,13,16,21,5,22,1,4,9,23,8,10,11,15,20,2,3,6,14,17,18,19,24}

  {8,17,18,9,6,19,1,12,10,13,14,2,3,20,15,4,21,22,23,16,5,7,11,24}
*)

Now, here's the "magic". Take note of the first and last two entries of the permutation (two being the length of our lookfor list):

lengthOflookfor = Length@lookfor;

transposed = Transpose[{orderOforder[[;; lengthOflookfor]], 
                        orderOforder[[-lengthOflookfor ;;]]}]

(* {{8, 11}, {17, 24}} *)

These pairs are in essence nothing more than spans, corresponding to the positions, for each of our lookfor elements, in the first ordering, mapping us the corresponding position(s) if any in the work list:

Column[{span = Span @@@ transposed, places = order[[#]] & /@ span, 
         work[[#]] & /@ places}, Left, 2]
(* 

{8;;11,17;;24}
{{1,4,9,23},{2,3,6,14,17,18,19,24}}
{{3,3,3,3},{5,5,5,5,5,5,5,5}}

*)

So we see that following the order from positions 8 to 11 and 17 to 24, we recover the positions of the elements we're interested in from the work list.

All that's left to do is to trim our padded element entries from the result and adjust the positions to get the actual positions from the original list:

Column[{recovered = ArrayPad[#, -1] & /@ places - lengthOflookfor,
         test[[#]] & /@ recovered}, Left, 2]

(*
{{2,7},{1,4,12,15,16,17}}
{{3,3},{5,5,5,5,5,5}}
*)

Because Ordering is, in general, very fast, this trick allows us to recover the positions of a large subset of elements (or show non-existence) from the target list very quickly.

bigList = RandomSample[DeleteDuplicates[RandomInteger[{1, 255}, {70000, 3}]], 60000];
smallList = RandomSample[bigList, 1200];
First@AbsoluteTiming[Map[Position[bigList, #] &, smallList]]

needs less than a second on my machine.

Efficiently finding the positions of a large list of targets in another, even larger list

Lists with duplicates

Alternative method

Tags:

Performance Tuning

Index

Related

Recent Posts