How to subtract two lists based on a key

We can use Complement to find the elements in set1 whose ID values are not contained within set2:

Complement[set1, set2, SameTest -> (#["ID"] === #2["ID"] &)]

Complement dataset screenshot

Update for the revised question

Complement provides a nice notational convenience, but for large (15,000) sets it runs orders of magnitude slower than the solution proposed in the question. The problem is that Complement is optimized to be very fast when it uses the standard SameTest but that performance degrades sharply when a user-supplied test is used.

We can get a bit of notational improvement while retaining performance by combining the two solutions:

JoinAcross[Complement[set1[All, {"ID"}], set2[All, {"ID"}]], set1, "ID"]

JoinAcross/Complement dataset screenshot

Here we use the standard "high-speed" version of Complement to operate upon the ID fields projected out of the original sets. We then generate the result by joining the complemented keys to the original set.

The original solution could also be simplified a little yet remain performant by incorporating the projection strategy alone, sticking with DeleteCases over Complement:

JoinAcross[
  DeleteCases[set1[All, {"ID"}], Alternatives @@ set2[All, {"ID"}]]
, set1, "ID" ]

JoinAcross/DeleteCases dataset screenshot

Datasets Are Slightly Slower

None of these solutions is faster than the original. It is hard to beat the highly-optimized Cases/Alternatives combination (though I look forward to someone posting something faster). However, we can gain around a 10% performance improvement on all of these solutions if we operate upon the raw lists of associations rather than objects wrapped in Dataset. Dataset introduces some minor overheads. Normally, I would not abandon a notational convenience in return for a 10% performance improvement, but the option is there if desired.

Imperative Solution

An imperative solution offers the same fast performance again:

Module[{keep}
, keep[_] = True
; Scan[(keep[#ID] = False)&, set2]
; set1[Select[keep[#ID]&]]
]

Even though it is no faster than the other solutions, it retains its speed while offering flexibility should the retention criteria go beyond simple key equality.

Performance Disclaimer

All claims of performance in this response are current as of version 12.0.0. Many WL functions have shown wild swings in performance between releases, even by orders of magnitude (both up and down). JoinAcross is one such function. Element access with Association and Dataset have also shown swings, although not as dramatic as JoinAcross.

First, let's do it with datasets having unnamed columns, because they are simpler. We use Intersection to find common elements in the first column. We use a simple replacement rule that replaces a list, which represents a row in the dataset, with Nothing.

{nest1, nest2} = {{{"A", 1}, {"B", 2}, {"C", 3}, {"D", 4}}, 
     {{"A", 111}, {"C", 333}, {"E", 555}}};

{set1, set2} = Dataset /@ {nest1, nest2}

x = Intersection[set1[[All, 1]], set2[[All, 1]]];

set1 = set1 /. ({#, _} -> Nothing &  /@ x);
set2 = set2 /. ({#, _} -> Nothing &  /@ x);

{set1, set2}

enter image description here

Now do it using datasets with named columns. We use the same syntax for the Intersection, but now the rows of the datasets are represented with associations instead of lists, so the replacement rule is a little different.

{set1, set2} = {{{"A", 1}, {"B", 2}, {"C", 3}, {"D", 4}}, {{"A", 
     111}, {"C", 333}, {"E", 555}}} // 
  Query[All, Dataset, AssociationThread[{"ID", "Value"} -> #] &]

With[{x = Intersection[set1[[All, 1]], set2[[All, 1]]]},
 set1 = set1 /. (<|"ID" -> #, _|> -> Nothing &  /@ x);
 set2 = set2 /. (<|"ID" -> #, _|> -> Nothing &  /@ x);
 ]
{set1, set2}

Same output as before.

Speed Comparison

The results of RepeatedTiming indicate the intersection/replacement method can be 3 times faster than JoinAcross for datasets with named columns in MMA 12.0.0. In the comparison test the first set has 20000 rows, the second set has 1500 rows and the number of rows in common is 300. Here's how the test sets are generated:

nrows1 = 20000; nrows2 = 1500; ncomm = 300;
nest1 = Transpose[{
    StringJoin /@ 
     RandomChoice[CharacterRange["A", "Z"], {nrows1, 8}],
    RandomInteger[{1, 1000}, nrows1]}];
nest2 = RandomSample@Join[
    Transpose[{
      StringJoin /@ 
       RandomChoice[CharacterRange["A", "Z"], {nrows2 - ncomm, 6}],
      RandomInteger[{1, 1000}, nrows2 - ncomm]}],
    RandomChoice[nest1, ncomm]];

{set1, set2} = {nest1, nest2} // 
   Query[All, Dataset, AssociationThread[{"ID", "Value"} -> #] &];

Check the RepeatedTiming results for the JoinAcross method

JoinAcross[set1, 
   DeleteCases[set1[All, "ID"], Alternatives @@ set2[All, "ID"]][
    All, <|"ID" -> #|> &], "ID"] // RepeatedTiming // First
(*  0.15  *)

Check the RepeatedTiming results for the intersection / replacement method

With[{x = Intersection[set1[[All, 1]], set2[[All, 1]]]},
   With[{rule = <|"ID" -> #, _|> -> Nothing &  /@ x},
    set1 = set1 /. rule;
    ]] // RepeatedTiming // First
(*  0.03749  *)

Similar results were obtained with datasets having 200000 rows and with smaller datasets.

How to subtract two lists based on a key

Tags:

Dataset

Filtering

List Manipulation

Related

Recent Posts