How to subtract two lists based on a key
We can use Complement
to find the elements in set1
whose ID values are not contained within set2
:
Complement[set1, set2, SameTest -> (#["ID"] === #2["ID"] &)]
Update for the revised question
Complement
provides a nice notational convenience, but for large (15,000) sets it runs orders of magnitude slower than the solution proposed in the question. The problem is that Complement
is optimized to be very fast when it uses the standard SameTest
but that performance degrades sharply when a user-supplied test is used.
We can get a bit of notational improvement while retaining performance by combining the two solutions:
JoinAcross[Complement[set1[All, {"ID"}], set2[All, {"ID"}]], set1, "ID"]
Here we use the standard "high-speed" version of Complement
to operate upon the ID fields projected out of the original sets. We then generate the result by joining the complemented keys to the original set.
The original solution could also be simplified a little yet remain performant by incorporating the projection strategy alone, sticking with DeleteCases
over Complement
:
JoinAcross[
DeleteCases[set1[All, {"ID"}], Alternatives @@ set2[All, {"ID"}]]
, set1, "ID" ]
Datasets Are Slightly Slower
None of these solutions is faster than the original. It is hard to beat the highly-optimized Cases
/Alternatives
combination (though I look forward to someone posting something faster). However, we can gain around a 10% performance improvement on all of these solutions if we operate upon the raw lists of associations rather than objects wrapped in Dataset
. Dataset
introduces some minor overheads. Normally, I would not abandon a notational convenience in return for a 10% performance improvement, but the option is there if desired.
Imperative Solution
An imperative solution offers the same fast performance again:
Module[{keep}
, keep[_] = True
; Scan[(keep[#ID] = False)&, set2]
; set1[Select[keep[#ID]&]]
]
Even though it is no faster than the other solutions, it retains its speed while offering flexibility should the retention criteria go beyond simple key equality.
Performance Disclaimer
All claims of performance in this response are current as of version 12.0.0. Many WL functions have shown wild swings in performance between releases, even by orders of magnitude (both up and down). JoinAcross
is one such function. Element access with Association
and Dataset
have also shown swings, although not as dramatic as JoinAcross
.
First, let's do it with datasets having unnamed columns, because they are simpler. We use Intersection
to find common elements in the first column. We use a simple replacement rule that replaces a list, which represents a row in the dataset, with Nothing
.
{nest1, nest2} = {{{"A", 1}, {"B", 2}, {"C", 3}, {"D", 4}},
{{"A", 111}, {"C", 333}, {"E", 555}}};
{set1, set2} = Dataset /@ {nest1, nest2}
x = Intersection[set1[[All, 1]], set2[[All, 1]]];
set1 = set1 /. ({#, _} -> Nothing & /@ x);
set2 = set2 /. ({#, _} -> Nothing & /@ x);
{set1, set2}
Now do it using datasets with named columns. We use the same syntax for the Intersection
, but now the rows of the datasets are represented with associations instead of lists, so the replacement rule is a little different.
{set1, set2} = {{{"A", 1}, {"B", 2}, {"C", 3}, {"D", 4}}, {{"A",
111}, {"C", 333}, {"E", 555}}} //
Query[All, Dataset, AssociationThread[{"ID", "Value"} -> #] &]
With[{x = Intersection[set1[[All, 1]], set2[[All, 1]]]},
set1 = set1 /. (<|"ID" -> #, _|> -> Nothing & /@ x);
set2 = set2 /. (<|"ID" -> #, _|> -> Nothing & /@ x);
]
{set1, set2}
Same output as before.
Speed Comparison
The results of RepeatedTiming
indicate the intersection/replacement method can be 3 times faster than JoinAcross
for datasets with named columns in MMA 12.0.0. In the comparison test the first set has 20000 rows, the second set has 1500 rows and the number of rows in common is 300. Here's how the test sets are generated:
nrows1 = 20000; nrows2 = 1500; ncomm = 300;
nest1 = Transpose[{
StringJoin /@
RandomChoice[CharacterRange["A", "Z"], {nrows1, 8}],
RandomInteger[{1, 1000}, nrows1]}];
nest2 = RandomSample@Join[
Transpose[{
StringJoin /@
RandomChoice[CharacterRange["A", "Z"], {nrows2 - ncomm, 6}],
RandomInteger[{1, 1000}, nrows2 - ncomm]}],
RandomChoice[nest1, ncomm]];
{set1, set2} = {nest1, nest2} //
Query[All, Dataset, AssociationThread[{"ID", "Value"} -> #] &];
Check the RepeatedTiming
results for the JoinAcross
method
JoinAcross[set1,
DeleteCases[set1[All, "ID"], Alternatives @@ set2[All, "ID"]][
All, <|"ID" -> #|> &], "ID"] // RepeatedTiming // First
(* 0.15 *)
Check the RepeatedTiming
results for the intersection / replacement method
With[{x = Intersection[set1[[All, 1]], set2[[All, 1]]]},
With[{rule = <|"ID" -> #, _|> -> Nothing & /@ x},
set1 = set1 /. rule;
]] // RepeatedTiming // First
(* 0.03749 *)
Similar results were obtained with datasets having 200000 rows and with smaller datasets.