Removing elements that are subsets of other elements

How about

subsetQ[set1_, set2_] := Intersection[set1, set2] == Sort[set1];
lists = {{a, b, c}, {a, b, d}, {d, e}, {d}, {a}, {a, b}, {f}};
Select[lists, ! Or @@ Table[subsetQ[#, set],
 {set, Complement[lists, {#}]}] &]

Let's pick this apart. Select[list, f] selects from list all those elements x for which f[x] is True. Now the Table[..] portion takes one of the elements (represented by #) of the lists and checks to see if it is a subet of each of the other lists. Of course, we don't want to check # itself, hence the Complement[..] business. Here's an illustration where #={d}:

Complement[lists, {{d}}]
dTest = Table[subsetQ[{d}, set], {set, Complement[lists, {{d}}]}]

(* Out1: {{a}, {f}, {a, b}, {d, e}, {a, b, c}, {a, b, d}} *)
(* Out2: {False, False, False, True, False, True} *)

Now, if anyone of these is True, we want to exclude that list. Note that Or@@dTest returns True if any elment of the list dTest is True.

Or @@ dTest

(* Out: True *)

Slap the negator ! in front, and we see that {d} is excluded.

If the lists are not ordered, however, then my subsetQ wasn't working; should take order into account. This might fix it:

subsetQ[set1_, set2_] := Intersection[set1, set2] == Sort[set1];
lists = {{"test", "1"}, {"test", "1", "1"}, {"test", "1", "2"}, 
  {"test", "1", "3"}, {"test", "1", "4"}, {"test", "1", "5"}, 
  {"test", "2"}, {"test", "2", "1"}, {"test", "2", "2"}, 
  {"test", "2", "3"}};
Select[lists, ! Or @@ Table[subsetQ[#, set], {set, 
  Complement[lists, {#}]}] &] // InputForm

(* Out: {{"test", "1", "1"}, {"test", "1", "3"}, {"test", "1", "4"}, 
         {"test", "1", "5"}, {"test", "2", "2"}, {"test", "2", "3"}} *)

Is that what you're hoping for?

I don't think this question, at least as I understand it, has been sufficiently addressed.

I have come up with two different methods and since they each have strengths and weaknesses I shall present both, and then a hybrid method that attempts to be general.

These functions are still not optimal as needless comparisons are made (a set of the same length but different elements cannot contain the first, for example) or operations performed (ideally I would not need DeleteDuplicates).

`Subsets`

For long lists of short subsets we can look at the problem in reverse. By that I mean we can compute all the subsets (power set) for a given set and use that information to determine which elements to keep. We can use Sow and the second parameter of Reap to collect (only) those subsets that appear in our master list. Our function will sow each subset (from the master list) to each of its own subsets as tags. We will Scan the list in a reverse Sort order to sow long sets first.

f1[lst_List] := With[{slst = Sort /@ lst}, DeleteDuplicates[
   Reap[# ~Sow~ Subsets@# & ~Scan~ Reverse@Sort@slst, slst][[2, All, 1, 1]]
  ]]

Test:

f1 @ {{2}, {4, 1}, {5, 2}, {1}, {5}, {3, 5, 1}, {0, 3, 5}, {2, 5, 4, 1}, {1, 4, 3}}

{{1, 2, 4, 5}, {1, 3, 5}, {0, 3, 5}, {1, 3, 4}}

This formulation also handles duplicate elements in a single subset which the next function does not. Subsets and subset elements are not returned in the the order they are given; I made the assumption that this is acceptable, but if not I'll address it later.

Bit mask

As sets become longer generating a power set becomes impractical. One can instead encode the contents of each set as a bit mask which allows faster comparison than a high-level Intersection etc. (In later versions the rls = . . . code could likely be improved by using ArrayComponents but I chose not to program blind.)

At this point I make an assumption: there are no duplicate elements within a single subset.

f2[l_List] := Module[{rls, out, test, unique = Union @@ l},
  test[a_, b_] := If[BitAnd[a, b]~MemberQ~a, ## &[], a];
  rls = Dispatch @ Thread[# -> 2^Range[0, Length@# - 1]] & @ Reverse @ unique;
  out = Table[test[#[[i]], #[[i + 1 ;;]]], {i, Length@#}] & @ Total[Sort@l /. rls, {2}];
  Pick[unique, #, 1] & /@ IntegerDigits[out, 2, Length @ unique]
]

Test:

f2 @ {{6, 12, 7, 4}, {10, 4, 1}, {12, 3}, {6, 9, 15}, {4, 9, 3, 7, 6, 2, 8, 5, 1, 11, 12, 15}}

{{1, 4, 10}, {1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 15}}

Hybrid

We can now make a hybrid function that selects between these methods based on the input. The test I'll use if very simple, checking only the length of the longest subset. I have not attempted to optimize it.

hybrid[lst_List] := If[2^Max[Length /@ lst] < Length@lst, f1, f2][lst]

If you require the original subset and element order you can use the output from hybrid to extract these based on pattern matching. The fastest method I found is rather convoluted, converting items to string to keep Pick from looking too deep.

ordered[lst_List] := 
  Pick[lst, ToString /@ Sort /@ lst, Alternatives @@ ToString /@ hybrid[lst]]

Timings

So what does all this buy you over Mark's far simpler code? Let's find out.

mark[lists_List] := With[{subsetQ = Intersection[##] == Sort[#] &},
   Select[lists, ! Or @@ Table[subsetQ[#, set], {set, Complement[lists, {#}]}] &]
 ]

SetAttributes[timeAvg, HoldFirst]
timeAvg[func_] := Do[If[# > 0.3, Return[#/5^i]] & @@ Timing@Do[func, {5^i}], {i, 0, 15}]

First with long subsets (calls f2):

lst = DeleteDuplicates[RandomSample[Range@200, #] & /@ RandomInteger[{1, 99}, 1000]];

timeAvg @ #[lst] & /@ {mark, hybrid, ordered}

{13.26, 0.0998, 0.1342}

So 133X faster without ordering and 99X faster with.

Now the other data shape (calls f1):

lst = DeleteDuplicates[RandomSample[Range@20, #] & /@ RandomInteger[{1, 5}, 5000]];

timeAvg @ #[lst] & /@ {mark, hybrid, ordered}

{32.651, 0.05244, 0.1122}

Here 622X faster without ordering and 291X with.

I'd say the extra code is worth it, no?

You can use the Bentley-Clarkson-Levine algorithm that is encoded in Internal`ListMin to find the maximal elements of your list. To do so, we need to encode the elements as lists of 0s and 1s where the 0s indicate the symbols that are included in the list. For your example this would be

{{a,b,c      }, {a,b,  d    }, {d,e        }, {      d    }, {a          }, {a,b,       }, {          f}}
{{0,0,0,1,1,1}, {0,0,1,0,1,1}, {1,1,1,0,0,1}, {1,1,1,0,1,1}, {0,1,1,1,1,1}, {0,0,1,1,1,1}, {1,1,1,1,1,0}}

Now, consider {a, b, c} and {a, b}, or:

{a, b, c} -> {0, 0, 0, 1, 1, 1}
{a, b}    -> {0, 0, 1, 1, 1, 1}

For this pair, Internal`ListMin will remove {0, 0, 1, 1, 1, 1} because it's "bigger" than {0, 0, 0, 1, 1, 1} (the function is called Internal`ListMin, not Internal`ListMax after all). Using Internal`ListMin on the list gives:

Internal`ListMin[{{0,0,0,1,1,1}, {0,0,1,0,1,1}, {1,1,1,0,0,1}, {1,1,1,0,1,1}, {0,1,1,1,1,1}, {0,0,1,1,1,1}, {1,1,1,1,1,0}}]

{{0, 0, 0, 1, 1, 1}, {0, 0, 1, 0, 1, 1}, {1, 1, 1, 0, 0, 1}, {1, 1, 1, 1, 1, 0}}

And this corresponds to the original sublists:

Pick[{a, b, c, d, e, f}, #, 0]& /@ %

{{a, b, c}, {a, b, d}, {d, e}, {f}}

in agreement with the expected answer. Putting the above pieces together gives the following function:

PartialMaximum[l:{__List}]:=Module[{ass, un, indexed, s},
    un = Union @@ l;
    ass = AssociationThread[#, Range @ Length @ #]& @ Apply[Union] @ l;
    indexed = Replace[l, ass, {2}];
    s = ConstantArray[1, {Length[l], Length[ass]}];
    MapIndexed[(s[[First@#2, #1]] = 0)&, indexed];
    Pick[un, #, 0]& /@ Internal`ListMin @ s
]

Your example again:

PartialMaximum[{{a,b,c},{a,b,d},{d,e},{d},{a},{a,b},{f}}]

{{a, b, c}, {a, b, d}, {d, e}, {f}}

A larger example:

lst = DeleteDuplicates[RandomSample[Range@200,#]&/@RandomInteger[{1,99},1000]];
PartialMaximum[lst]; //AbsoluteTiming

{0.066836, Null}

In my experiments, this is slightly faster than MrWizard's answer.

Removing elements that are subsets of other elements

`Subsets`

Bit mask

Hybrid

Timings

Tags:

List Manipulation

Performance Tuning

Related

Recent Posts