How to speed up Select with MemberQ for large lists?
This should be pretty snappy:
With[{t = First@Transpose@#1, t2 = Transpose[#2][[2]]},
Pick[#1, Replace[t, Dispatch[Thread[Rule[Intersection[t, t2], True]]], {1}]]] &[list1, list2]
Perhaps not the fastest, but reasonably fast (requires V10+):
Select[Association[Thread[list2[[All, 2]] -> True]] @* First][list1]
A slightly slower version of this to use for earlier versions:
With[{rules = Dispatch[Thread[list2[[All, 2]] -> True]]},
Select[list1, Replace[First[#], rules] &]
]
One can make use of Nearest
for this purpose (I should mention that the speed improvement may only occur in M11.1+). Here is a function that does this:
carl[l1_,l2_] := With[{nf = Nearest[l2[[All, 2]] -> "Index"]},
Pick[
l1,
Unitize[Length /@ nf[l1[[All, 1]], {All, 0}]],
1
]
]
The key idea is the use of the {All, 0}
spec for the NearestFunction
nf
, that only returns a result for exact hits. Let's compare this to @ciao's approach (modified to be a function):
ciao[l1_, l2_] := With[{t = First@Transpose@l1, t2 = Transpose[l2][[2]]},
Pick[
l1,
Replace[t, Dispatch[Thread[Rule[Intersection[t, t2],True]]], {1}]
]
]
data1 = RandomInteger[10^6, {335000,2}];
data2 = RandomInteger[10^6, {122000,4}];
r1 = ciao[data1, data2]; //RepeatedTiming
r2 = carl[data1, data2]; //RepeatedTiming
r1 === r2
{0.41, Null}
{0.17, Null}
True
So, a speed up of about 2.5.
Update
After further thought, it makes sense to apply NearestFunction
to the other data list. So, here is a version that uses this approach:
carl2[l1_, l2_] := With[{nf = Nearest[l1[[All,1]]->"Index"]},
l1[[Union @@ nf[l2[[All, 2]], {All, 0}]]]
]
Here is another speed comparison:
data1 = RandomInteger[10^6, {335000,2}];
data2 = RandomInteger[10^6, {122000,4}];
r1 = ciao[data1, data2]; //RepeatedTiming
r2 = carl[data1, data2]; //RepeatedTiming
r3 = carl2[data1, data2]; //RepeatedTiming
r1 === r2 === r3
{0.37, Null}
{0.17, Null}
{0.10, Null}
True
A nice speed up!