SQL-like computation on Dataset: Aggregate function
It appears that an operation equivalent to SQL GROUP BY is surprisingly awkward to express:
a[GroupBy[Key["A"]] /* Values, <| "A" -> First /* "A", "B" -> Query[Total, "B"] |>]
If the application will tolerate a resulting dataset that is non-rectangular in shape, then the result can be obtained with a more natural query (more natural for Mathematica that is):
a[GroupBy@Key@"A", Total, {"B"}]
Discussion
For purposes of discussion, let's consider a more elaborate input dataset:
$data = Dataset @
{ <| "A" -> "A1", "B" -> 1, "C" -> -1 |>
, <| "A" -> "A1", "B" -> 10, "C" -> -10 |>
, <| "A" -> "A2", "B" -> 2, "C" -> -2 |>
, <| "A" -> "A2", "B" -> 20, "C" -> -20 |>
, <| "A" -> "A2", "B" -> 200, "C" -> -200 |>
, <| "A" -> "A3", "B" -> 3, "C" -> -3 |>
}
We are going to work up to a query that groups by column A and gives us the sums of columns B and C. We start by grouping on A:
$data[
GroupBy[Key["A"]]
]
As we can see, the GroupBy
operator returns a list of associations each with a single key that names the group value. This key is redundant as each group also contains the corresponding A value. We will discard the redundant keys using Values
:
$data[
GroupBy[Key["A"]] /* Values
]
We end up with a list of lists, each sublist containing the associations that comprise a group. We want to end up with one association for each group, identified by its A value. Since every association within a group has the same value for A, we only need to access one of them (here, the first):
$data[
GroupBy[Key["A"]] /* Values
, <| "A" -> First /* "A" |>
]
We also want each result association to have the sum of the B values from the group:
$data[
GroupBy[Key["A"]] /* Values
, <| "A" -> First /* "A"
, "B" -> Query[Total, "B"]
|>
]
We could have written "B" -> Total /* "B"
instead to get the same result. But this seemingly innocent form hides a potential performance problem. It totals all columns first, and only afterwards extracts the total for B. In general, this causes quadratic time complexity which could make the query unacceptably slow for large numbers of columns. The use of Query[Total, "B"]
avoids this pitfall. The difference can be seen using Dataset`ShowPlan
:
Dataset`ShowPlan[Total /* "B"]
(* Total /* GeneralUtilities`Slice["B"] *)
Dataset`ShowPlan[Query[Total, "B"]]
(* Map[GeneralUtilities`Slice["B"]] /* Total *)
For good measure, we wind up by including the sums of the C values as well:
$data[
GroupBy[Key["A"]] /* Values
, <| "A" -> First /* "A"
, "B" -> Query[Total, "B"]
, "C" -> Query[Total, "C"]
|>
]
This is our final result.
Simpler Alternative
If we are willing to bend a little bit on the shape of the final dataset, we can get the desired data using a much simpler query:
$data[GroupBy@Key@"A", Total, {"B", "C"}]
This dataset is no longer "rectangular" like an SQL table but is rather a nested set of associations. If the application can tolerate this different shape, then the reduced conceptual complexity of the query is desirable.
Update: GroupBy@Key?
An astute commentator points out that, according to the documentation, the last query can leave out the explicit call to Key
:
$data[GroupBy["A"], Total, {"B", "C"}]
The same is true for every such usage throughout the post. Unfortunately, a bug in version 10.0.1 prevented this abbreviated form from working properly. Fortunately, the bug is fixed in version 10.0.2. I've left the explicit use of Key
in place just to be on the safe side, but if one is using 10.0.2, then be aware that a shorter form exists.
a = Dataset[{<|"A" -> "A1", "B" -> 2|>, <|"A" -> "A1", "B" -> 3|>}];
Total[a]["B"]
(* 5 *)
In addition to the excellent answer given by WReach I would like to show, how we can come up with the rectangular result the OP wants:
f = Function[ data,
data // RightComposition[
Query[ GroupBy["A"], Total /* KeyDrop["A"] ], (* cf. WReach's answer *)
Query[ KeyValueMap @ Function[ { key, rowdata },
Association[ "A" -> key, rowdata ]
] (* make the data rectangular again *)
]
];
Then
f @ Dataset[{<|"A" -> "A1", "B" -> 2|>, <|"A" -> "A1", "B" -> 3|>}]
(* using the more elaborate data given by WReach *)
$data = Dataset@{<|"A" -> "A1", "B" -> 1, "C" -> -1|>, <|"A" -> "A1",
"B" -> 10, "C" -> -10|>, <|"A" -> "A2", "B" -> 2, "C" -> -2|>, <|
"A" -> "A2", "B" -> 20, "C" -> -20|>, <|"A" -> "A2", "B" -> 200,
"C" -> -200|>, <|"A" -> "A3", "B" -> 3, "C" -> -3|>};
f @ $data
A couple of things to note (from my experience):
- using
Query
makes the solution more general asdataset[ op1, op2,... ]
is limited toDataset
- using
RightComposition
(/*
) allows for very compact code - applying Query sequentially makes the solution more readable maybe
- using
Function
with named vars also makes code more readable