How to select minimal subsets?

Solution

minimal[sets_] :=
  Module[{f},
    f[x__] := (f[x, ___] = Sequence[]; {x});
    SetAttributes[f, Orderless];
    f @@@ Sort @ sets
  ]

If the original order in the subsets must be retained one may introduce an auxiliary symbol without loss of performance:

minimal2[sets_] :=
  Module[{f, g},
    f[x__] := (f[x, ___] = True; False);
    g[a_] /; f @@ a = Sequence[];
    g[a_] := a;
    SetAttributes[f, Orderless];
    g /@ Sort @ sets
  ] 

Given that many definitions are created during this process a significant amount of time is spent ordering them. By using SetSystemOptions["DefinitionsReordering" -> "None"] we can eliminate this time, making an already fast function 2X faster.

minimalFast[sets_] :=
  Module[{f, g, op = SystemOptions["DefinitionsReordering"]},
    g[f[x__]] := (f[x, ___] = 1; {x});
    g[1] = Sequence[];
    SetAttributes[f, Orderless];
    SetSystemOptions["DefinitionsReordering" -> "None"];
    # &[
      g[f @@ #] & /@ Sort@sets,
      SetSystemOptions[op]
    ]
  ]

Timings

Using Lenoid's data and top-level function, and Heike's minSubsets:

randomSets = Table[Range@# ~RandomSample~ RandomInteger@{3, #} & @ 30, {8000}]; 

(r0 = minimal[randomSets]);         // Timing // First

(r1 = minimalFast[randomSets]);     // Timing // First

(r2 = selectMinimalHT[randomSets]); // Timing // First

(r3 = minSubsets[randomSets]);      // Timing // First

r0 === r1 === Sort /@ r2 === Sort /@ r3

0.234

0.109

1.482

15.257

True


Explanation

An explanation of this code was requested. First an understanding of the basic form of this method is required. Its mechanism is explained in this answer.

What remains is the working of the Orderless attribute. This is fairly simple in concept but rather tricky in application.

The first property is that arguments are automatically sorted before anything else is done, even before the function sees them: f[2, 1, 3, 4] becomes f[1, 2, 3, 4].

The second property, and the one at the heart of this answer, is that the pattern-matching engine takes into account Orderless such that MatchQ[f[5, 7, 2], f[7, __]] is True, because there is an ordering of 5, 7, 2 that matches 7, __.

Putting this together with the version 4 UnsortedUnion function and you have a function that deletes a set if it contains all the elements of a previously seen set.


There is a complication however. The third property of Orderless is the effect it has on the creation of definitions. Among other things it changes the order in which rules are tried. Normally Mathematica orders DownValues by specificity. Because 1, ___ is more specific than __ this returns "Match":

ClearAll[f];

f[__] = "Fail"; f[1, ___] = "Match";

f[1, 2, 3]

"Match"

Orderless changes this behavior:

ClearAll[f];

SetAttributes[f, Orderless];

f[__] = "Fail"; f[1, ___] = "Match";

f[1, 2, 3]

"Fail"

I credit Simon Woods for showing me how to get around this: the definitions made before the attribute is set are still automatically ordered relative to the other DownValues. Here __ is tried after 1, __ because it is less specific:

ClearAll[f]

f[__] = "Fail";

SetAttributes[f, Orderless]

f[1, ___] = "Match";

f[1, 2, 3]

"Match"


You could do something like

minSubsets[lst_] := DeleteDuplicates[SortBy[lst, Length], Intersection[#1, #2] === Sort[#1] &]

Then for the example in the question you get

lst = {{a, b}, {b, c}, {a, b, c}, {a, b, e}, {a, c, e}, {a, e, d, f}};

minSubsets[lst]

(* out: {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}} *)

Hybrid Mathematica - Java solution

Since the top-level solution from EDIT is still rather slow, here is a Java port of it. To use it, you have to first load the Java reloader into your session.

Code

Having done that, we have to compile this class:

JCompileLoad@"import java.util.*;

   public class MinSubsets{
      public static Object[] getMinimalSubsets(int[] lsortedflat, 
                    int[] lengths){
          int[][] lsorted = new int[lengths.length][];
          int ctr = 0;
          for(int i=0;i<lengths.length;i++){
             lsorted[i] = new int[lengths[i]];
             for(int j=0;j<lengths[i];j++){
                lsorted[i][j] = lsortedflat[ctr++];
             }
          }
          int[] positions = new int[lsorted.length];
          for(int i=0;i<lsorted.length;i++){
             positions[i]=i;
          }
          Map<Integer,Set<Integer>> hash = new HashMap<Integer,Set<Integer>>();
          for(int i=0;i<lsorted.length;i++){
             for(int elem:lsorted[i] ){
                if(!hash.containsKey(elem)){
                   hash.put(elem,new HashSet<Integer>());
                }
                hash.get(elem).add(i);
             }
          }
          List<int[]> aux = new ArrayList<int[]>();
          for(int i=0;i<lsorted.length;i++){
             if(positions[i]==-1) continue;
             Set<Integer> containing = 
                new HashSet<Integer>(hash.get(lsorted[i][0]));
             for(int j = 1; j<lsorted[i].length;j++){
                containing.retainAll(hash.get(lsorted[i][j]));
             }          
             for(int elem : lsorted[i]){ 
                hash.get(elem).removeAll(containing);
             }          
             for(int pos : containing){             
                if( pos == i)continue;              
                positions[pos]=-1;
             }
             aux.add(lsorted[i]);
          }     
          return aux.toArray(); 
      }
   }"

Now, here is the Mathematica part:

ClearAll[getMinSubsets];
getMinSubsets[l : {{__Integer} ..}] :=
  With[{sorted = Sort@l},
     MinSubsets`getMinimalSubsets[Flatten[sorted ], Length /@ sorted]
  ];

getMinSubsets[l_List] :=
  With[{rules = 
      Thread[# -> Range[Length[#]]] &[DeleteDuplicates[Flatten[l]]]
    },
    Map[ Developer`ToPackedArray,
      getMinSubsets[l /. Dispatch[rules]]
    ] /. Dispatch[Reverse[rules, {2}]]
  ];

The idea is that for integer elements, I send a flattened list of them to Java plus the list of the lengths of subsets, while for general elements I first map unique elements to inetegers, then do the same thing, then map those back.

Tests and benchmarks

For our test example:

getMinSubsets[sets]

(*  {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}}  *)

Now, the real sample (you will need to load the definiton of selectMinimalHT below, and also Heike's minSubsets, for comparison:

(res=getMinSubsets [ randomSets])//Length//AbsoluteTiming
(res1=selectMinimalHT[randomSets ])//Length//AbsoluteTiming
(res2 = minSubsets[randomSets ])//Length//AbsoluteTiming
res==res1==res2

(*
     {0.8750000,1177}
     {7.4492188,1177}
     {63.5615234,1177}
     True
*)

Conclusions

Depending on the data (how large are subsets on the average, and how big is a fraction of subsets containing other subsets), the hybrid Java - Mathematica solution can be 10-20 times faster than top-level Mathematica solution, and 50-100 times faster than Heike's one-liner, which I believe is the fastest of other posted solutions (the truth is that her and other posted solutions have quadratic complexity in the size of the subset list, so the larger it is, the more dramatic will be the performance difference).

EDIT The solution of @Mr.Wizard is actually the fastest top-level Mathematica solution, being only 1.5 times slower than this Java one, but also much shorter and more memory efficient END EDIT

This shows once again what can be a successful optimization path: prototype the algorithm in Mathematica first, get the asymptotic complexity right, and then move heavy part to Java.

The Java solution is also memory-hungry, like my Mathematica top-level one (and unlike Heike's solution which is very memory-efficient). So, for truly large lists, one may have to proceed iteratively, and / or also have lots of RAM available.

In any case, this Java solution may be fast enough to process your real sets in realistic time.

Top - level optimized solution using nested hash tables (used in the above Java solution as a prototype)

EDIT Apparently @Mr.Wizard's latest code is much faster than this and also much shorter END EDIT

Since you mentioned that you need to process rather large lists of subsets, I tried to optimize my code. Here is the fastest top-level implementation I was able to come up with:

Clear[selectMinimalHT];
selectMinimalHT[sets_List] :=
  Module[{hash, sorted = Transpose[{#, Range@Length@#} &@Sort@sets], 
     result},
   Do[hash[elem] = Unique[], {elem, Union@Flatten@sets}];
   Reap[Sow[#, First@#] & /@ sorted, _, 
       Do[hash[#1][set] = True, {set, #2}] &
   ];
   result  = 
     Reap[Do[
        If[sorted[[i]] == {}, Continue[]];
        Sow[sorted[[i, 1]]];
        With[{containing = 
          Apply[Intersection,
            Map[
              With[{sym = hash[#]},
                 DownValues[sym, Sort -> False][[All, 1, 1, 1]]
              ] &,
              sorted[[i, 1]]
            ]
          ]},
          Do[
             With[{sym  = hash[elem]},
               If[ValueQ[sym[set]], Unset[sym[set]]]
             ],
             {set, containing}, 
             {elem, First@set}
          ];
          sorted[[containing[[All, 2]]]] = {};
        ], (* With *)
        {i, Length[sorted]}
      ]
     ][[2, 1]];
     Remove @@ DownValues[hash][[All, 1, 1]];
     result
  ];

This is based on nested hash-tables, which are modified at run-time, but other than that, it is the same algorithm as in my original code. But, using hash-tables allows me to avoid requent copying of large lists, and, more importantly, the rules telling us which subsets are still potentially valid are updated at each step, which wasn't the case for Dispath-based rules. This allows to at least have a good asymptotic complexity, although perhaps with a large constant factor coming from a large overhead of top-level Mathematica code.

You use is as:

selectMinimalHT[sets]

(* {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}} *)

Here is a more realistic sample:

randomSets = 
  Table[RandomSample[#,RandomInteger[{3,Length[#]}]]&@Range[100],{50000}];

selectMinimalHT[randomSets]//Short//AbsoluteTiming

(*
  {93.8876953,{{1,15,24},<<4703>>,
   {14,70,12,9,31,90,18,65,64,92,26,48,84,57,62,1,76,7,2,4,44,67,22}}}
*)

The complexity is approximately n*l, where n is the size of the list, and l is the average size of a subset. Note that this solution becomes quite memory-hungry, so you may want to split your list in chunks and feed those iteratively to it, combining the result with the remainder to obtain a list to be used in a new iteration.

If your subset elements are numbers, the code can be significantly sped up, by, e.g., porting the above algorithm to Java (Mathematica's Compile won't do since we need hash tables).

Initial moderately fast solution

I think, the following will be reasonably fast (although, perhaps, not the fastest):

Clear[selectMinimal]
selectMinimal[sets_List] :=
  With[{rules = Dispatch[Reap[Sow[#, #] & /@ sets, _, Rule][[2]]]},
    If[# === {}, {}, First@#] &@
       Reap[
         NestWhile[
           With[{set  = Sow@First@#},
             Complement[Rest@#, Apply[Intersection, set /. rules]]
           ] &, 
           Sort[sets], 
           # =!= {} &]
       ][[2]]
  ];

In your case, you use it as

selectMinimal[sets]

(*  {{a, b}, {b, c}, {a, c, e}, {a, e, d, f}}  *)

For some larger example, I will generate a large list of random subsets of another list:

randomSets = 
  Table[RandomSample[#, RandomInteger[{3, Length[#]}]] &@ Range[30], {1000}];

I get then

selectMinimal[randomSets]//Short//AbsoluteTiming
{0.3535156,{{1,15,10},{2,30,11},<<182>>,{22,5,9,4,2,13,24,21,11,10,27},   
  {27,30,11,5,8,29,28,18,14,15,21}}}