SQL-like computation on Dataset: Aggregate function

It appears that an operation equivalent to SQL GROUP BY is surprisingly awkward to express:

a[GroupBy[Key["A"]] /* Values, <| "A" -> First /* "A", "B" -> Query[Total, "B"] |>]

dataset screenshot

If the application will tolerate a resulting dataset that is non-rectangular in shape, then the result can be obtained with a more natural query (more natural for Mathematica that is):

a[GroupBy@Key@"A", Total, {"B"}]

dataset screenshot


Discussion

For purposes of discussion, let's consider a more elaborate input dataset:

$data = Dataset @
  { <| "A" -> "A1", "B" -> 1, "C" -> -1 |>
  , <| "A" -> "A1", "B" -> 10, "C" -> -10 |>
  , <| "A" -> "A2", "B" -> 2, "C" -> -2 |>
  , <| "A" -> "A2", "B" -> 20, "C" -> -20 |>
  , <| "A" -> "A2", "B" -> 200, "C" -> -200 |>
  , <| "A" -> "A3", "B" -> 3, "C" -> -3 |>
  }

dataset screenshot

We are going to work up to a query that groups by column A and gives us the sums of columns B and C. We start by grouping on A:

$data[
  GroupBy[Key["A"]]
]

dataset screenshot

As we can see, the GroupBy operator returns a list of associations each with a single key that names the group value. This key is redundant as each group also contains the corresponding A value. We will discard the redundant keys using Values:

$data[
  GroupBy[Key["A"]] /* Values
]

dataset screenshot

We end up with a list of lists, each sublist containing the associations that comprise a group. We want to end up with one association for each group, identified by its A value. Since every association within a group has the same value for A, we only need to access one of them (here, the first):

$data[
  GroupBy[Key["A"]] /* Values
, <| "A" -> First /* "A" |>
]

dataset screenshot

We also want each result association to have the sum of the B values from the group:

$data[
  GroupBy[Key["A"]] /* Values
, <| "A" -> First /* "A"
   , "B" -> Query[Total, "B"]
  |>
]

dataset screenshot

We could have written "B" -> Total /* "B" instead to get the same result. But this seemingly innocent form hides a potential performance problem. It totals all columns first, and only afterwards extracts the total for B. In general, this causes quadratic time complexity which could make the query unacceptably slow for large numbers of columns. The use of Query[Total, "B"] avoids this pitfall. The difference can be seen using Dataset`ShowPlan:

Dataset`ShowPlan[Total /* "B"]
(* Total /* GeneralUtilities`Slice["B"] *)

Dataset`ShowPlan[Query[Total, "B"]]
(* Map[GeneralUtilities`Slice["B"]] /* Total *)

For good measure, we wind up by including the sums of the C values as well:

$data[
  GroupBy[Key["A"]] /* Values
, <| "A" -> First /* "A"
   , "B" -> Query[Total, "B"]
   , "C" -> Query[Total, "C"]
  |>
]

dataset screenshot

This is our final result.

Simpler Alternative

If we are willing to bend a little bit on the shape of the final dataset, we can get the desired data using a much simpler query:

$data[GroupBy@Key@"A", Total, {"B", "C"}]

dataset screenshot

This dataset is no longer "rectangular" like an SQL table but is rather a nested set of associations. If the application can tolerate this different shape, then the reduced conceptual complexity of the query is desirable.

Update: GroupBy@Key?

An astute commentator points out that, according to the documentation, the last query can leave out the explicit call to Key:

$data[GroupBy["A"], Total, {"B", "C"}]

The same is true for every such usage throughout the post. Unfortunately, a bug in version 10.0.1 prevented this abbreviated form from working properly. Fortunately, the bug is fixed in version 10.0.2. I've left the explicit use of Key in place just to be on the safe side, but if one is using 10.0.2, then be aware that a shorter form exists.


a = Dataset[{<|"A" -> "A1", "B" -> 2|>, <|"A" -> "A1", "B" -> 3|>}];

Total[a]["B"]
(* 5 *)

In addition to the excellent answer given by WReach I would like to show, how we can come up with the rectangular result the OP wants:

f = Function[ data,

    data // RightComposition[

        Query[ GroupBy["A"], Total /* KeyDrop["A"] ], (* cf. WReach's answer *)
        Query[ KeyValueMap @ Function[ { key, rowdata }, 
            Association[ "A" -> key, rowdata ]
        ] (* make the data rectangular again *)
    ]

];

Then

f @ Dataset[{<|"A" -> "A1", "B" -> 2|>, <|"A" -> "A1", "B" -> 3|>}]

OP data

(* using the more elaborate data given by WReach *)
$data = Dataset@{<|"A" -> "A1", "B" -> 1, "C" -> -1|>, <|"A" -> "A1", 
"B" -> 10, "C" -> -10|>, <|"A" -> "A2", "B" -> 2, "C" -> -2|>, <|
"A" -> "A2", "B" -> 20, "C" -> -20|>, <|"A" -> "A2", "B" -> 200, 
"C" -> -200|>, <|"A" -> "A3", "B" -> 3, "C" -> -3|>};

f @ $data

WReach data

A couple of things to note (from my experience):

  • using Query makes the solution more general as dataset[ op1, op2,... ] is limited to Dataset
  • using RightComposition (/*) allows for very compact code
  • applying Query sequentially makes the solution more readable maybe
  • using Function with named vars also makes code more readable

Tags:

Sql

Dataset