Pandas: aggregate when column contains numpy arrays
Pandas works much more efficiently if you don't do this (e.g using numeric data, as you suggest). Another alternative is to use a Panel object for this kind of multidimensional data.
Saying that, this looks like a bug, the Exception is being raised purely because the result is an array:
Exception: Must produce aggregated value
In [11]: %debug
> /Users/234BroadWalk/pandas/pandas/core/groupby.py(1511)_aggregate_named()
1510 if isinstance(output, np.ndarray):
-> 1511 raise Exception('Must produce aggregated value')
1512 result[name] = self._try_cast(output, group)
ipdb> output
array([50, 70, 90])
If you were to recklessly remove these two lines from the source code it works as expected:
In [99]: g.agg(sum)
Out[99]:
arraydata
category
1 [50, 70, 90]
2 [20, 30, 40]
Note: They're almost certainly in there for a reason...
One, perhaps more clunky way to do it would be to iterate over the GroupBy
object (it generates (grouping_value, df_subgroup)
tuples. For example, to achieve what you want here, you could do:
grouped = DF.groupby("category")
aggregate = list((k, v["arraydata"].sum()) for k, v in grouped)
new_df = pd.DataFrame(aggregate, columns=["category", "arraydata"]).set_index("category")
This is very similar to what pandas is doing under the hood anyways [groupby, then do some aggregation, then merge back in], so you aren't really losing out on much.
Diving into the Internals
The problem here is that pandas is checking explicitly that the output not be an ndarray
because it wants to intelligently reshape your array, as you can see in this snippet from _aggregate_named
where the error occurs.
def _aggregate_named(self, func, *args, **kwargs):
result = {}
for name, group in self:
group.name = name
output = func(group, *args, **kwargs)
if isinstance(output, np.ndarray):
raise Exception('Must produce aggregated value')
result[name] = self._try_cast(output, group)
return result
My guess is that this happens because groupby
is explicitly set up to try to intelligently put back together a DataFrame with the same indexes and everything aligned nicely. Since it's rare to have nested arrays in a DataFrame like that, it checks for ndarrays to make sure that you are actually using an aggregate function. In my gut, this feels like a job for Panel
, but I'm not sure how to transform it perfectly. As an aside, you can sidestep this problem by converting your output to a list, like this:
DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})
Pandas doesn't complain, because now you have an array of Python objects. [but this is really just cheating around the typecheck]. And if you want to convert back to array, just apply np.array
to it.
result = DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})
result["arraydata"] = result["arraydata"].apply(np.array)
How you want to resolve this issue really depends on why you have columns of ndarray
and whether you want to aggregate anything else at the same time. That said, you can always iterate over GroupBy
like I've shown above.