How to select minimal subsets?
Solution
minimal[sets_] :=
Module[{f},
f[x__] := (f[x, ___] = Sequence[]; {x});
SetAttributes[f, Orderless];
f @@@ Sort @ sets
]
If the original order in the subsets must be retained one may introduce an auxiliary symbol without loss of performance:
minimal2[sets_] :=
Module[{f, g},
f[x__] := (f[x, ___] = True; False);
g[a_] /; f @@ a = Sequence[];
g[a_] := a;
SetAttributes[f, Orderless];
g /@ Sort @ sets
]
Given that many definitions are created during this process a significant amount of time is spent ordering them. By using SetSystemOptions["DefinitionsReordering" -> "None"]
we can eliminate this time, making an already fast function 2X faster.
minimalFast[sets_] :=
Module[{f, g, op = SystemOptions["DefinitionsReordering"]},
g[f[x__]] := (f[x, ___] = 1; {x});
g[1] = Sequence[];
SetAttributes[f, Orderless];
SetSystemOptions["DefinitionsReordering" -> "None"];
# &[
g[f @@ #] & /@ Sort@sets,
SetSystemOptions[op]
]
]
Timings
Using Lenoid's data and top-level function, and Heike's minSubsets
:
randomSets = Table[Range@# ~RandomSample~ RandomInteger@{3, #} & @ 30, {8000}];
(r0 = minimal[randomSets]); // Timing // First
(r1 = minimalFast[randomSets]); // Timing // First
(r2 = selectMinimalHT[randomSets]); // Timing // First
(r3 = minSubsets[randomSets]); // Timing // First
r0 === r1 === Sort /@ r2 === Sort /@ r3
0.234
0.109
1.482
15.257
True
Explanation
An explanation of this code was requested. First an understanding of the basic form of this method is required. Its mechanism is explained in this answer.
What remains is the working of the Orderless
attribute. This is fairly simple in concept but rather tricky in application.
The first property is that arguments are automatically sorted before anything else is done, even before the function sees them: f[2, 1, 3, 4]
becomes f[1, 2, 3, 4]
.
The second property, and the one at the heart of this answer, is that the pattern-matching engine takes into account Orderless
such that MatchQ[f[5, 7, 2], f[7, __]]
is True
, because there is an ordering of 5, 7, 2
that matches 7, __
.
Putting this together with the version 4 UnsortedUnion
function and you have a function that deletes a set if it contains all the elements of a previously seen set.
There is a complication however. The third property of Orderless
is the effect it has on the creation of definitions. Among other things it changes the order in which rules are tried. Normally Mathematica orders DownValues
by specificity. Because 1, ___
is more specific than __
this returns "Match":
ClearAll[f];
f[__] = "Fail"; f[1, ___] = "Match";
f[1, 2, 3]
"Match"
Orderless
changes this behavior:
ClearAll[f];
SetAttributes[f, Orderless];
f[__] = "Fail"; f[1, ___] = "Match";
f[1, 2, 3]
"Fail"
I credit Simon Woods for showing me how to get around this: the definitions made before the attribute is set are still automatically ordered relative to the other DownValues
. Here __
is tried after 1, __
because it is less specific:
ClearAll[f]
f[__] = "Fail";
SetAttributes[f, Orderless]
f[1, ___] = "Match";
f[1, 2, 3]
"Match"
You could do something like
minSubsets[lst_] := DeleteDuplicates[SortBy[lst, Length], Intersection[#1, #2] === Sort[#1] &]
Then for the example in the question you get
lst = {{a, b}, {b, c}, {a, b, c}, {a, b, e}, {a, c, e}, {a, e, d, f}};
minSubsets[lst]
(* out: {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}} *)
Hybrid Mathematica - Java solution
Since the top-level solution from EDIT is still rather slow, here is a Java port of it. To use it, you have to first load the Java reloader into your session.
Code
Having done that, we have to compile this class:
JCompileLoad@"import java.util.*;
public class MinSubsets{
public static Object[] getMinimalSubsets(int[] lsortedflat,
int[] lengths){
int[][] lsorted = new int[lengths.length][];
int ctr = 0;
for(int i=0;i<lengths.length;i++){
lsorted[i] = new int[lengths[i]];
for(int j=0;j<lengths[i];j++){
lsorted[i][j] = lsortedflat[ctr++];
}
}
int[] positions = new int[lsorted.length];
for(int i=0;i<lsorted.length;i++){
positions[i]=i;
}
Map<Integer,Set<Integer>> hash = new HashMap<Integer,Set<Integer>>();
for(int i=0;i<lsorted.length;i++){
for(int elem:lsorted[i] ){
if(!hash.containsKey(elem)){
hash.put(elem,new HashSet<Integer>());
}
hash.get(elem).add(i);
}
}
List<int[]> aux = new ArrayList<int[]>();
for(int i=0;i<lsorted.length;i++){
if(positions[i]==-1) continue;
Set<Integer> containing =
new HashSet<Integer>(hash.get(lsorted[i][0]));
for(int j = 1; j<lsorted[i].length;j++){
containing.retainAll(hash.get(lsorted[i][j]));
}
for(int elem : lsorted[i]){
hash.get(elem).removeAll(containing);
}
for(int pos : containing){
if( pos == i)continue;
positions[pos]=-1;
}
aux.add(lsorted[i]);
}
return aux.toArray();
}
}"
Now, here is the Mathematica part:
ClearAll[getMinSubsets];
getMinSubsets[l : {{__Integer} ..}] :=
With[{sorted = Sort@l},
MinSubsets`getMinimalSubsets[Flatten[sorted ], Length /@ sorted]
];
getMinSubsets[l_List] :=
With[{rules =
Thread[# -> Range[Length[#]]] &[DeleteDuplicates[Flatten[l]]]
},
Map[ Developer`ToPackedArray,
getMinSubsets[l /. Dispatch[rules]]
] /. Dispatch[Reverse[rules, {2}]]
];
The idea is that for integer elements, I send a flattened list of them to Java plus the list of the lengths of subsets, while for general elements I first map unique elements to inetegers, then do the same thing, then map those back.
Tests and benchmarks
For our test example:
getMinSubsets[sets]
(* {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}} *)
Now, the real sample (you will need to load the definiton of selectMinimalHT
below, and also Heike's minSubsets
, for comparison:
(res=getMinSubsets [ randomSets])//Length//AbsoluteTiming
(res1=selectMinimalHT[randomSets ])//Length//AbsoluteTiming
(res2 = minSubsets[randomSets ])//Length//AbsoluteTiming
res==res1==res2
(*
{0.8750000,1177}
{7.4492188,1177}
{63.5615234,1177}
True
*)
Conclusions
Depending on the data (how large are subsets on the average, and how big is a fraction of subsets containing other subsets), the hybrid Java - Mathematica solution can be 10-20 times faster than top-level Mathematica solution, and 50-100 times faster than Heike's one-liner, which I believe is the fastest of other posted solutions (the truth is that her and other posted solutions have quadratic complexity in the size of the subset list, so the larger it is, the more dramatic will be the performance difference).
EDIT The solution of @Mr.Wizard is actually the fastest top-level Mathematica solution, being only 1.5 times slower than this Java one, but also much shorter and more memory efficient END EDIT
This shows once again what can be a successful optimization path: prototype the algorithm in Mathematica first, get the asymptotic complexity right, and then move heavy part to Java.
The Java solution is also memory-hungry, like my Mathematica top-level one (and unlike Heike's solution which is very memory-efficient). So, for truly large lists, one may have to proceed iteratively, and / or also have lots of RAM available.
In any case, this Java solution may be fast enough to process your real sets in realistic time.
Top - level optimized solution using nested hash tables (used in the above Java solution as a prototype)
EDIT Apparently @Mr.Wizard's latest code is much faster than this and also much shorter END EDIT
Since you mentioned that you need to process rather large lists of subsets, I tried to optimize my code. Here is the fastest top-level implementation I was able to come up with:
Clear[selectMinimalHT];
selectMinimalHT[sets_List] :=
Module[{hash, sorted = Transpose[{#, Range@Length@#} &@Sort@sets],
result},
Do[hash[elem] = Unique[], {elem, Union@Flatten@sets}];
Reap[Sow[#, First@#] & /@ sorted, _,
Do[hash[#1][set] = True, {set, #2}] &
];
result =
Reap[Do[
If[sorted[[i]] == {}, Continue[]];
Sow[sorted[[i, 1]]];
With[{containing =
Apply[Intersection,
Map[
With[{sym = hash[#]},
DownValues[sym, Sort -> False][[All, 1, 1, 1]]
] &,
sorted[[i, 1]]
]
]},
Do[
With[{sym = hash[elem]},
If[ValueQ[sym[set]], Unset[sym[set]]]
],
{set, containing},
{elem, First@set}
];
sorted[[containing[[All, 2]]]] = {};
], (* With *)
{i, Length[sorted]}
]
][[2, 1]];
Remove @@ DownValues[hash][[All, 1, 1]];
result
];
This is based on nested hash-tables, which are modified at run-time, but other than that, it is the same algorithm as in my original code. But, using hash-tables allows me to avoid requent copying of large lists, and, more importantly, the rules telling us which subsets are still potentially valid are updated at each step, which wasn't the case for Dispath
-based rules. This allows to at least have a good asymptotic complexity, although perhaps with a large constant factor coming from a large overhead of top-level Mathematica code.
You use is as:
selectMinimalHT[sets]
(* {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}} *)
Here is a more realistic sample:
randomSets =
Table[RandomSample[#,RandomInteger[{3,Length[#]}]]&@Range[100],{50000}];
selectMinimalHT[randomSets]//Short//AbsoluteTiming
(*
{93.8876953,{{1,15,24},<<4703>>,
{14,70,12,9,31,90,18,65,64,92,26,48,84,57,62,1,76,7,2,4,44,67,22}}}
*)
The complexity is approximately n*l
, where n
is the size of the list, and l
is the average size of a subset. Note that this solution becomes quite memory-hungry, so you may want to split your list in chunks and feed those iteratively to it, combining the result with the remainder to obtain a list to be used in a new iteration.
If your subset elements are numbers, the code can be significantly sped up, by, e.g., porting the above algorithm to Java (Mathematica's Compile
won't do since we need hash tables).
Initial moderately fast solution
I think, the following will be reasonably fast (although, perhaps, not the fastest):
Clear[selectMinimal]
selectMinimal[sets_List] :=
With[{rules = Dispatch[Reap[Sow[#, #] & /@ sets, _, Rule][[2]]]},
If[# === {}, {}, First@#] &@
Reap[
NestWhile[
With[{set = Sow@First@#},
Complement[Rest@#, Apply[Intersection, set /. rules]]
] &,
Sort[sets],
# =!= {} &]
][[2]]
];
In your case, you use it as
selectMinimal[sets]
(* {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}} *)
For some larger example, I will generate a large list of random subsets of another list:
randomSets =
Table[RandomSample[#, RandomInteger[{3, Length[#]}]] &@ Range[30], {1000}];
I get then
selectMinimal[randomSets]//Short//AbsoluteTiming
{0.3535156,{{1,15,10},{2,30,11},<<182>>,{22,5,9,4,2,13,24,21,11,10,27},
{27,30,11,5,8,29,28,18,14,15,21}}}