How to use groupby transform across multiple columns
The way I read the question, you want to be able to do something arbitrary with both the individual values from both columns. You just need to make sure to return a dataframe of the same size as you get passed in. I think the best way is to just make a new column, like this:
df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],
'd':['z','z','z','o','o','o']})
df['e']=0
def f(x):
y=(x['a']+x['b'])/sum(x['b'])
return pd.DataFrame({'e':y,'a':x['a'],'b':x['b']})
df.groupby(['c','d']).transform(f)
:
a b e
0 1 1 0.333333
1 2 2 0.666667
2 3 3 1.000000
3 4 4 2.000000
4 5 5 0.909091
5 6 6 1.090909
If you have a very complicated dataframe, you can pick your columns (e.g. df.groupby(['c'])['a','b','e'].transform(f)
)
This sure looks very inelegant to me, but it's still much faster than apply
on large datasets.
Another alternative is to use set_index
to capture all the columns you need and then pass just one column to transform
.
Circa Pandas version 0.18, it appears the original answer (below) no longer works.
Instead, if you need to do a groupby computation across multiple columns, do the multi-column computation first, and then the groupby:
df = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[1,2,3,4,5,6],
'c':['q', 'q', 'q', 'q', 'w', 'w'],
'd':['z','z','z','o','o','o']})
df['e'] = df['a'] + df['b']
df['e'] = (df.groupby(['c', 'd'])['e'].transform('sum'))
print(df)
yields
a b c d e
0 1 1 q z 12
1 2 2 q z 12
2 3 3 q z 12
3 4 4 q o 8
4 5 5 w o 22
5 6 6 w o 22
Original answer:
The error message:
TypeError: cannot concatenate a non-NDFrame object
suggests that in order to concatenate, the foo_function
should return an NDFrame (such as a Series or DataFrame). If you return a Series, then:
In [99]: df.groupby(['c', 'd']).transform(lambda x: pd.Series(np.sum(x['a']+x['b'])))
Out[99]:
a b
0 12 12
1 12 12
2 12 12
3 8 8
4 22 22
5 22 22