Removing elements that are subsets of other elements
How about
subsetQ[set1_, set2_] := Intersection[set1, set2] == Sort[set1];
lists = {{a, b, c}, {a, b, d}, {d, e}, {d}, {a}, {a, b}, {f}};
Select[lists, ! Or @@ Table[subsetQ[#, set],
{set, Complement[lists, {#}]}] &]
Let's pick this apart. Select[list, f]
selects from list
all those elements x
for which f[x]
is True
. Now the Table[..]
portion takes one of the elements (represented by #
) of the lists
and checks to see if it is a subet of each of the other lists. Of course, we don't want to check #
itself, hence the Complement[..]
business. Here's an illustration where #={d}
:
Complement[lists, {{d}}]
dTest = Table[subsetQ[{d}, set], {set, Complement[lists, {{d}}]}]
(* Out1: {{a}, {f}, {a, b}, {d, e}, {a, b, c}, {a, b, d}} *)
(* Out2: {False, False, False, True, False, True} *)
Now, if anyone of these is True
, we want to exclude that list. Note that Or@@dTest
returns True
if any elment of the list dTest
is True
.
Or @@ dTest
(* Out: True *)
Slap the negator !
in front, and we see that {d}
is excluded.
If the lists are not ordered, however, then my subsetQ
wasn't working; should take order into account. This might fix it:
subsetQ[set1_, set2_] := Intersection[set1, set2] == Sort[set1];
lists = {{"test", "1"}, {"test", "1", "1"}, {"test", "1", "2"},
{"test", "1", "3"}, {"test", "1", "4"}, {"test", "1", "5"},
{"test", "2"}, {"test", "2", "1"}, {"test", "2", "2"},
{"test", "2", "3"}};
Select[lists, ! Or @@ Table[subsetQ[#, set], {set,
Complement[lists, {#}]}] &] // InputForm
(* Out: {{"test", "1", "1"}, {"test", "1", "3"}, {"test", "1", "4"},
{"test", "1", "5"}, {"test", "2", "2"}, {"test", "2", "3"}} *)
Is that what you're hoping for?
I don't think this question, at least as I understand it, has been sufficiently addressed.
I have come up with two different methods and since they each have strengths and weaknesses I shall present both, and then a hybrid method that attempts to be general.
These functions are still not optimal as needless comparisons are made (a set of the same length but different elements cannot contain the first, for example) or operations performed (ideally I would not need DeleteDuplicates
).
Subsets
For long lists of short subsets we can look at the problem in reverse. By that I mean we can compute all the subsets (power set) for a given set and use that information to determine which elements to keep. We can use Sow
and the second parameter of Reap
to collect (only) those subsets that appear in our master list.
Our function will sow each subset (from the master list) to each of its own subsets as tags. We will Scan
the list in a reverse Sort
order to sow long sets first.
f1[lst_List] := With[{slst = Sort /@ lst}, DeleteDuplicates[
Reap[# ~Sow~ Subsets@# & ~Scan~ Reverse@Sort@slst, slst][[2, All, 1, 1]]
]]
Test:
f1 @ {{2}, {4, 1}, {5, 2}, {1}, {5}, {3, 5, 1}, {0, 3, 5}, {2, 5, 4, 1}, {1, 4, 3}}
{{1, 2, 4, 5}, {1, 3, 5}, {0, 3, 5}, {1, 3, 4}}
This formulation also handles duplicate elements in a single subset which the next function does not. Subsets and subset elements are not returned in the the order they are given; I made the assumption that this is acceptable, but if not I'll address it later.
Bit mask
As sets become longer generating a power set becomes impractical. One can instead encode the contents of each set as a bit mask which allows faster comparison than a high-level Intersection
etc. (In later versions the rls = . . .
code could likely be improved by using ArrayComponents
but I chose not to program blind.)
At this point I make an assumption: there are no duplicate elements within a single subset.
f2[l_List] := Module[{rls, out, test, unique = Union @@ l},
test[a_, b_] := If[BitAnd[a, b]~MemberQ~a, ## &[], a];
rls = Dispatch @ Thread[# -> 2^Range[0, Length@# - 1]] & @ Reverse @ unique;
out = Table[test[#[[i]], #[[i + 1 ;;]]], {i, Length@#}] & @ Total[Sort@l /. rls, {2}];
Pick[unique, #, 1] & /@ IntegerDigits[out, 2, Length @ unique]
]
Test:
f2 @ {{6, 12, 7, 4}, {10, 4, 1}, {12, 3}, {6, 9, 15}, {4, 9, 3, 7, 6, 2, 8, 5, 1, 11, 12, 15}}
{{1, 4, 10}, {1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 15}}
Hybrid
We can now make a hybrid function that selects between these methods based on the input. The test I'll use if very simple, checking only the length of the longest subset. I have not attempted to optimize it.
hybrid[lst_List] := If[2^Max[Length /@ lst] < Length@lst, f1, f2][lst]
If you require the original subset and element order you can use the output from hybrid
to extract these based on pattern matching. The fastest method I found is rather convoluted, converting items to string to keep Pick
from looking too deep.
ordered[lst_List] :=
Pick[lst, ToString /@ Sort /@ lst, Alternatives @@ ToString /@ hybrid[lst]]
Timings
So what does all this buy you over Mark's far simpler code? Let's find out.
mark[lists_List] := With[{subsetQ = Intersection[##] == Sort[#] &},
Select[lists, ! Or @@ Table[subsetQ[#, set], {set, Complement[lists, {#}]}] &]
]
SetAttributes[timeAvg, HoldFirst]
timeAvg[func_] := Do[If[# > 0.3, Return[#/5^i]] & @@ Timing@Do[func, {5^i}], {i, 0, 15}]
First with long subsets (calls f2
):
lst = DeleteDuplicates[RandomSample[Range@200, #] & /@ RandomInteger[{1, 99}, 1000]];
timeAvg @ #[lst] & /@ {mark, hybrid, ordered}
{13.26, 0.0998, 0.1342}
So 133X faster without ordering and 99X faster with.
Now the other data shape (calls f1
):
lst = DeleteDuplicates[RandomSample[Range@20, #] & /@ RandomInteger[{1, 5}, 5000]];
timeAvg @ #[lst] & /@ {mark, hybrid, ordered}
{32.651, 0.05244, 0.1122}
Here 622X faster without ordering and 291X with.
I'd say the extra code is worth it, no?
You can use the Bentley-Clarkson-Levine algorithm that is encoded in Internal`ListMin
to find the maximal elements of your list. To do so, we need to encode the elements as lists of 0s and 1s where the 0s indicate the symbols that are included in the list. For your example this would be
{{a,b,c }, {a,b, d }, {d,e }, { d }, {a }, {a,b, }, { f}}
{{0,0,0,1,1,1}, {0,0,1,0,1,1}, {1,1,1,0,0,1}, {1,1,1,0,1,1}, {0,1,1,1,1,1}, {0,0,1,1,1,1}, {1,1,1,1,1,0}}
Now, consider {a, b, c}
and {a, b}
, or:
{a, b, c} -> {0, 0, 0, 1, 1, 1}
{a, b} -> {0, 0, 1, 1, 1, 1}
For this pair, Internal`ListMin
will remove {0, 0, 1, 1, 1, 1}
because it's "bigger" than {0, 0, 0, 1, 1, 1}
(the function is called Internal`ListMin
, not Internal`ListMax
after all). Using Internal`ListMin
on the list gives:
Internal`ListMin[{{0,0,0,1,1,1}, {0,0,1,0,1,1}, {1,1,1,0,0,1}, {1,1,1,0,1,1}, {0,1,1,1,1,1}, {0,0,1,1,1,1}, {1,1,1,1,1,0}}]
{{0, 0, 0, 1, 1, 1}, {0, 0, 1, 0, 1, 1}, {1, 1, 1, 0, 0, 1}, {1, 1, 1, 1, 1, 0}}
And this corresponds to the original sublists:
Pick[{a, b, c, d, e, f}, #, 0]& /@ %
{{a, b, c}, {a, b, d}, {d, e}, {f}}
in agreement with the expected answer. Putting the above pieces together gives the following function:
PartialMaximum[l:{__List}]:=Module[{ass, un, indexed, s},
un = Union @@ l;
ass = AssociationThread[#, Range @ Length @ #]& @ Apply[Union] @ l;
indexed = Replace[l, ass, {2}];
s = ConstantArray[1, {Length[l], Length[ass]}];
MapIndexed[(s[[First@#2, #1]] = 0)&, indexed];
Pick[un, #, 0]& /@ Internal`ListMin @ s
]
Your example again:
PartialMaximum[{{a,b,c},{a,b,d},{d,e},{d},{a},{a,b},{f}}]
{{a, b, c}, {a, b, d}, {d, e}, {f}}
A larger example:
lst = DeleteDuplicates[RandomSample[Range@200,#]&/@RandomInteger[{1,99},1000]];
PartialMaximum[lst]; //AbsoluteTiming
{0.066836, Null}
In my experiments, this is slightly faster than MrWizard's answer.