How to check whether a sublist exist in a huge database lists in a fast way?

Cases is pretty fast. Consider the case where you have 10,000 lists, each with 100 numbers. Cases can find all lists with the given subsequence in less than 0.05 seconds:

SeedRandom[100]
data = RandomInteger[{0, 100}, {10000, 100}];
checklist = {29, 49, 25, 53, 57, 25, 86};

Cases[data, {___, Sequence @@ checklist, ___}] // RepeatedTiming

{0.049, {{44, 38, 64, 92, 40, 83, 82, ... }}

If you are only looking for the first instance of such a list, then FirstCase can be used. If you are only looking for the positions, then Position can be used, and similarly, FirstPosition can be used if you only need the first sequence that can be found.

In case you need both the position and the list, use this:

pos = FirstPosition[data, {___, Sequence @@ checklist, ___}];
list = Extract[data, pos];

Instead of using SequenceCount in a loop, locate all possible matches of the checklist with SequencePosition in the flattened dataset in one go:

exist = SequencePosition[Flatten@datatest, checklist] // 
    Quotient[(# - 1), Last@Dimensions@datatest] & // 
    AnyTrue[Apply[SameQ]]

Because a false match may occur in the flattened dataset, we only consider matches where the beginning index and the end index of the match lie in the same row.

The Quotient ... expression computes the row index from the position index for the beginning and the end of each match. The AnyTrue ... then checks for the presence of a genuine match.

Being intrigued by the question I did some measurements. Below are some benchmarks showing the most performant solution. All tests are done on Macbook Pro i9, 3.2 GHz, 32 Gb Ram.

First, the test data, a list of 10^6 sublists varying between 0 and 100 in lenght:

datatest = Table[Table[RandomInteger[10], {j, RandomInteger[{0, 100}]}], {i, 1000000}];
checklist = {38, 3, 32, 24, 58, 8};

The author's own solution: 21.6039 seconds

Timing[
 For[iii = 1, iii <= Length[datatest], iii++, 
  exist = SequenceCount[datatest[[iii]], checklist];
  If[exist != 0,
   Print["Found it", iii];
   Break[]
   ]
  ]
 ]

Optimizing for loop with Map: 19.9 seconds

Timing[
 Map[SequenceCount[#, checklist]  &, datatest] // Tally
 ]

Solution by @C. E. 0.263156 seconds (impressive gains)

Timing[
 Cases[datatest, {___, Sequence @@ checklist, ___}]
 ]

Solution by @sakra, 0.216559 seconds (The best so far)

Timing[SequencePosition[Flatten@datatest, checklist] // 
   Quotient[(# - 1), Last@Dimensions@datatest] & // 
  AnyTrue[Apply[SameQ]]]

How to check whether a sublist exist in a huge database lists in a fast way?

Tags:

List Manipulation

Performance Tuning

Related

Recent Posts