Finding all sublists or substrings of a given list/string

TMTOWTDI applies to both of these problems. Below I present an overview of various approaches I've come across, followed by timing data obtained in 10.4 on Windows 10 (the timing code is available as well, so you can easily rerun the tests on your own machine if you have a different setup). Which solution is best for you depends both on the problem size as well as which ordering of the sublists or substrings you're looking for.

One note up front: several implementations use the 10.0 function Catenate. If you're using an older version, this can be replaced with Join @@ which is a bit slower but probably won't change the overall performance ordering of the solutions.

Sublists

These can be generated very concisely with ReplaceList and an appropriate pattern.
```
sublists1[list_List] := ReplaceList[list, {___, sub___, ___}:>{sub}]
```
To omit empty sublists, simply replace sub___ by sub__.
Alternatively, just generate them explicitly from start and end indices, using Table. You'll need to Catenate the result though, or you'll get the sublists grouped by starting index:
```
sublists2[list_List] := Catenate @ Table[
  list[[i ;; j]], 
  {i, Length@list + 1}, 
  {j, i-1, Length@list}
]
```
To omit empty sublists, let j start from i instead of i-1.
As of 10.1 there is SequenceCases whose Overlaps option can be used to get all the sublists:
```
sublists3[list_List] := SequenceCases[list, {___}, Overlaps -> All]
```
To omit empty sublists, use {__} instead of {___}. Credits for this approach go to RunnyKine.
Instead of using Table we can also construct the index pairs of each sublist from Subsets. However, this only gives one instance of the empty list, so we'll need to add the others manually:
```
sublists4[list_List] := With[{len = Length@list},
  Join[
    ConstantArray[{}, len], 
    Take[list, #] & /@ Subsets[Range@len, 2]
  ]
]
```
To omit empty sublists, ditch Join and ConstantArray and replace 2 with {1,2}. Credits for this approach go to Kuba.
As of 10.4 there's even a built-in for this task, but for some reason it also returns only one copy of the empty list:
```
sublists5[list_List] := ConstantArray[{}, Length@list]~Join~Subsequences[list]
```
To omit empty sublists, ditch Join and ConstantArray and give Subsequences an nspec of {1,Infinity}.
You can also collect the results of using Partition with overlaps and all possible sublist lengths:
```
sublists6[list_List] := Catenate @ Table[
  Partition[list, n, 1], 
  {n, 0, Length@list}
]
```
To omit empty sublists, simply remove the 0,.

When performance is not a concern, the most important distinguishing feature of these is the order of the returned sublists:

list = {1, 2, 3};
sublists1@list (* ReplaceList *)
sublists2@list (* Table *)
sublists3@list (* SequenceCases *)
sublists4@list (* Subsets *)
sublists5@list (* Subsequences *)
sublists6@list (* Partition *)

(* {{}, {1}, {1, 2}, {1, 2, 3}, {}, {2}, {2, 3}, {}, {3}, {}} *)
(* {{}, {1}, {1, 2}, {1, 2, 3}, {}, {2}, {2, 3}, {}, {3}, {}} *)
(* {{1, 2, 3}, {1, 2}, {1}, {}, {2, 3}, {2}, {}, {3}, {}, {}} *)
(* {{}, {}, {}, {}, {1}, {2}, {3}, {1, 2}, {1, 2, 3}, {2, 3}} *)
(* {{}, {}, {}, {}, {1}, {2}, {3}, {1, 2}, {2, 3}, {1, 2, 3}} *)
(* {{}, {}, {}, {}, {1}, {2}, {3}, {1, 2}, {2, 3}, {1, 2, 3}} *)

Things to note: ReplaceList orders them by starting index, followed by length. SequenceCases does the same but takes length from longest to shortest. Subsequences and Partitions have them in order of increasing length first. Somewhat weirdly Subset first has the length-0 and length-1 lists, and then lists the remaining ones in the same order as ReplaceList. Note that while the Table approach here uses the same ordering as ReplaceList this one is most easily adapted to different ordering by changing the limits and order of the iterators.

Finally, timing results. Most notably, SequenceCases is unusably slow for larger lists (so I've omitted it from the longer timing results), but Subsequences also gets much slower than rolling your own implementation:

list = RandomInteger[5, 100];
RepeatedTiming[sublists1[list];] (*ReplaceList*)
RepeatedTiming[sublists2[list];] (*Table*)
RepeatedTiming[sublists3[list];] (*SequenceCases*)
RepeatedTiming[sublists4[list];] (*Subsets*)
RepeatedTiming[sublists5[list];] (*Subsequences*)
RepeatedTiming[sublists6[list];] (*Partition*)
(*
    {0.00513, Null}
    {0.00452, Null}
    {0.018, Null}    <-- nope
    {0.00378, Null}
    {0.00208, Null}
    {0.00187, Null}
*)

list = RandomInteger[5, 1000];
RepeatedTiming[sublists1[list];] (*ReplaceList*)
RepeatedTiming[sublists2[list];] (*Table*)
RepeatedTiming[sublists4[list];] (*Subsets*)
RepeatedTiming[sublists5[list];] (*Subsequences*)
RepeatedTiming[sublists6[list];] (*Partition*)
(*
  {2.81, Null}
  {1.2, Null}
  {1.2, Null}
  {2.3, Null}
  {2.0, Null}
*)

list = RandomInteger[5, 1200];
RepeatedTiming[sublists1[list];] (*ReplaceList*)
RepeatedTiming[sublists2[list];] (*Table*)
RepeatedTiming[sublists4[list];] (*Subsets*)
RepeatedTiming[sublists5[list];] (*Subsequences*)
RepeatedTiming[sublists6[list];] (*Partition*)
(*
  {4.6, Null}
  {2.0, Null}
  {2.0, Null}
  {5.1, Null}
  {6.0, Null}
*)

In summary, the Table solution is the overall winner in terms of flexibility and performance, but it's good to have some other options for a terse quick-and-dirty solution when you can afford to prioritise readability.

Substrings

There are several more possibilities for strings (which usually map to an approach above), because you can usually either use specific string functions or convert the string to a list of characters to reduce it to one of the above implementations. (And unfortunately, it's not always the case that the string-based solution is faster.)

Adapt the above ReplaceList approach by applying Characters to the string first, and then replace {sub} by StringJoin[sub]:
```
substrings1[string_String] := ReplaceList[
  Characters @ string, 
  {___, sub___, ___} :> StringJoin[sub]
]
```
To omit empty strings, use sub__.
Write a similar version with StringReplaceList and StringExpression for pattern. The catch is to anchor the StringExpression, otherwise you'll get horrible amounts of duplicates.
```
substrings2[string_String] := StringReplaceList[
  string, 
  StartOfString ~~ ___ ~~ sub___ ~~ ___ ~~ EndOfString :> sub
]
```
To omit empty strings, use sub__.
The StringReplaceList version can be written more compactly with a RegularExpression:
```
substrings3[string_String] := StringReplaceList[
  string, 
  RegularExpression["^.*(.*).*$"] :> "$1"
]
```
To omit empty strings, use the regex "^.*(.+).*$".

Adapt the above Table approach, similar to 1, by obtaining the characters first:

substrings4[string_String] := Module[{chars = Characters@string},
  Join @@ Table[
    StringJoin @@ chars[[i ;; j]],
    {i, Length@chars + 1},
    {j, i - 1, Length@chars}
  ]
]

To omit empty strings, let j start from i.

Write a similar version which uses string manipulation functions:

substrings5[string_String] :=
 Join @@ Table[
   StringTake[string, {i, j}],
   {i, StringLength@string + 1},
   {j, i - 1, StringLength@string}
 ]

To omit empty strings, let j start from i.

Adapt the above SequenceCases approach using StringCases.
```
substrings6[string_String] := StringCases[string, ___, Overlaps -> All]
```
To omit empty strings, use __. Credits for this approach go to SquareOne.

Adapt Kuba's Subsets solution:

substrings7[string_String] := With[{len = StringLength@string},
  Join[
    ConstantArray["", len], 
    StringTake[string, #] & /@ Subsets[Range@len, 2]
  ]
]

To omit empty strings, ditch Join and ConstantArray and replace 2 with {1,2}.

Adapt the Partition solution using StringPartition, which was added in 10.1 and updated in 10.4:
```
substrings8[string_String] := Catenate@Table[
  StringPartition[string, n, 1], 
  {n, 0, StringLength@string}
]
```
To omit empty strings, remove the 0,.

There is no string pendant to Subsequences at this point.

Let's look at the order of the results:

string = "abc";
substrings1[string](*ReplaceList*)
substrings2[string](*StringReplaceList + StringPattern*)
substrings3[string](*StringReplaceList + RegularExpression*)
substrings4[string](*Table + Characters*)
substrings5[string](*Table + StringTake*)
substrings6[string](*StringCases*)
substrings7[string](*Subsets*)
substrings8[string](*StringPartition*)
(*
  {"", "a", "ab", "abc", "", "b", "bc", "", "c", ""}
  {"", "c", "", "bc", "b", "", "abc", "ab", "a", ""}
  {"", "c", "", "bc", "b", "", "abc", "ab", "a", ""}
  {"", "a", "ab", "abc", "", "b", "bc", "", "c", ""}
  {"", "a", "ab", "abc", "", "b", "bc", "", "c", ""}
  {"abc", "ab", "a", "", "bc", "b", "", "c", "", ""}
  {"", "", "", "", "a", "b", "c", "ab", "abc", "bc"}
  {"", "", "", "", "a", "b", "c", "ab", "bc", "abc"}
*)

Interestingly, StringReplaceList yields the opposite order from ReplaceList (which I believe is due to greedy matching of the prefix), which itself orders them by starting index first, length second. StringCases, like SubsequenceCases does the same but with decreasing length. Subsets still has its funny order of index pairs and the StringPartition solution sorts them by length. Again, remember that the Table approaches can easily be adapted to yield almost any order you want.

As for performance, and it turns out that the StringReplaceList versions are much slower than the other three. Timing them for a 100-character string:

string = StringJoin @ ConstantArray["a", 100];
Timing[substrings1[string];](*ReplaceList*)
Timing[substrings2[string];](*StringReplaceList + StringPattern*)
Timing[substrings3[string];](*StringReplaceList + RegularExpression*)
Timing[substrings4[string];](*Table + Characters*)
Timing[substrings5[string];](*Table + StringTake*)
Timing[substrings6[string];](*StringCases*)
Timing[substrings7[string];](*Subsets*)
Timing[substrings8[string];](*StringPartition*)
(*
    {0., Null}
    {3.84375, Null}
    {5.28125, Null}
    {0.015625, Null}
    {0., Null}
    {0.015625, Null}
    {0., Null}
    {0., Null}
*)

Comparing the others, we find that the string-based Table and Subsets approaches easily outperform the others on large inputs:

string = StringJoin @ ConstantArray["a", 100];
RepeatedTiming[substrings1[string];](*ReplaceList*)
RepeatedTiming[substrings4[string];](*Table + Characters*)
RepeatedTiming[substrings5[string];](*Table + StringTake*)
RepeatedTiming[substrings6[string];](*StringCases*)
RepeatedTiming[substrings7[string];](*Subsets*)
RepeatedTiming[substrings8[string];](*StringPartition*)
(*
    {0.0096, Null}
    {0.0121, Null}
    {0.00421, Null}
    {0.00573, Null}
    {0.00450, Null}
    {0.00612, Null}
*)

string = StringJoin @ ConstantArray["a", 1000];
RepeatedTiming[substrings1[string];](*ReplaceList*)
RepeatedTiming[substrings4[string];](*Table + Characters*)
RepeatedTiming[substrings5[string];](*Table + StringTake*)
RepeatedTiming[substrings6[string];](*StringCases*)
RepeatedTiming[substrings7[string];](*Subsets*)
RepeatedTiming[substrings8[string];](*StringPartition*)
(*
    {5.74, Null}
    {4.818, Null}
    {1.92, Null}
    {4.59, Null}
    {2.000, Null}
    {2.36, Null}
*)

For strings, there is also :

string = "abcd";

StringCases[string, _ ~~ LetterCharacter ..., Overlaps -> All]

or rather (as @Kuba greatly suggested)

StringCases[string, __, Overlaps -> All]

{abcd, abc, ab, a, bcd, bc, b, cd, c, d}

(5 "" are missing but that's not the point here)

It seems almost as fast as substrings1 according to my tests.

f[l_List] := Take[l, #] & /@ Subsets[Range[Length[l]], {1, 2}];

f[s_String] := StringJoin @@@ f[Characters[s]]

string = "abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcab" <> 
   "cabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabca";

a = f[string]; // Timing

{ 0.015600, Null}

Timing[b = substrings1[string];]

{0.015600, Null}

Complement[b, a]

{}

f @ Range @ 10

{{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {10}, {1, 2}, {1, 2, 
 3}, {1, 2, 3, 4}, {1, 2, 3, 4, 5}, {1, 2, 3, 4, 5, 6}, {1, 2, 3, 4, 
 5, 6, 7}, {1, 2, 3, 4, 5, 6, 7, 8}, {1, 2, 3, 4, 5, 6, 7, 8, 9}, {1,
  2, 3, 4, 5, 6, 7, 8, 9, 10}, {2, 3}, {2, 3, 4}, {2, 3, 4, 5}, {2, 
 3, 4, 5, 6}, {2, 3, 4, 5, 6, 7}, {2, 3, 4, 5, 6, 7, 8}, {2, 3, 4, 5,
  6, 7, 8, 9}, {2, 3, 4, 5, 6, 7, 8, 9, 10}, {3, 4}, {3, 4, 5}, {3, 
 4, 5, 6}, {3, 4, 5, 6, 7}, {3, 4, 5, 6, 7, 8}, {3, 4, 5, 6, 7, 8, 
 9}, {3, 4, 5, 6, 7, 8, 9, 10}, {4, 5}, {4, 5, 6}, {4, 5, 6, 7}, {4, 
 5, 6, 7, 8}, {4, 5, 6, 7, 8, 9}, {4, 5, 6, 7, 8, 9, 10}, {5, 6}, {5,
 6, 7}, {5, 6, 7, 8}, {5, 6, 7, 8, 9}, {5, 6, 7, 8, 9, 10}, {6, 
 7}, {6, 7, 8}, {6, 7, 8, 9}, {6, 7, 8, 9, 10}, {7, 8}, {7, 8, 
 9}, {7, 8, 9, 10}, {8, 9}, {8, 9, 10}, {9, 10}}

Finding all sublists or substrings of a given list/string

Sublists

Substrings

Tags:

Combinatorics

List Manipulation

String Manipulation

Related

Recent Posts