How to check whether a sublist exist in a huge database lists in a fast way?
Cases
is pretty fast. Consider the case where you have 10,000 lists, each with 100 numbers. Cases
can find all lists with the given subsequence in less than 0.05 seconds:
SeedRandom[100]
data = RandomInteger[{0, 100}, {10000, 100}];
checklist = {29, 49, 25, 53, 57, 25, 86};
Cases[data, {___, Sequence @@ checklist, ___}] // RepeatedTiming
{0.049, {{44, 38, 64, 92, 40, 83, 82, ... }}
If you are only looking for the first instance of such a list, then FirstCase
can be used. If you are only looking for the positions, then Position
can be used, and similarly, FirstPosition
can be used if you only need the first sequence that can be found.
In case you need both the position and the list, use this:
pos = FirstPosition[data, {___, Sequence @@ checklist, ___}];
list = Extract[data, pos];
Instead of using SequenceCount
in a loop, locate all possible matches of the checklist with SequencePosition in the flattened dataset in one go:
exist = SequencePosition[Flatten@datatest, checklist] //
Quotient[(# - 1), Last@Dimensions@datatest] & //
AnyTrue[Apply[SameQ]]
Because a false match may occur in the flattened dataset, we only consider matches where the beginning index and the end index of the match lie in the same row.
The Quotient ...
expression computes the row index from the position index for the beginning and the end of each match. The AnyTrue ...
then checks for the presence of a genuine match.
Being intrigued by the question I did some measurements. Below are some benchmarks showing the most performant solution. All tests are done on Macbook Pro i9, 3.2 GHz, 32 Gb Ram.
First, the test data, a list of 10^6 sublists varying between 0 and 100 in lenght:
datatest = Table[Table[RandomInteger[10], {j, RandomInteger[{0, 100}]}], {i, 1000000}];
checklist = {38, 3, 32, 24, 58, 8};
The author's own solution: 21.6039 seconds
Timing[
For[iii = 1, iii <= Length[datatest], iii++,
exist = SequenceCount[datatest[[iii]], checklist];
If[exist != 0,
Print["Found it", iii];
Break[]
]
]
]
Optimizing for
loop with Map
: 19.9 seconds
Timing[
Map[SequenceCount[#, checklist] &, datatest] // Tally
]
Solution by @C. E. 0.263156 seconds (impressive gains)
Timing[
Cases[datatest, {___, Sequence @@ checklist, ___}]
]
Solution by @sakra, 0.216559 seconds (The best so far)
Timing[SequencePosition[Flatten@datatest, checklist] //
Quotient[(# - 1), Last@Dimensions@datatest] & //
AnyTrue[Apply[SameQ]]]