How to use groupby transform across multiple columns

The way I read the question, you want to be able to do something arbitrary with both the individual values from both columns. You just need to make sure to return a dataframe of the same size as you get passed in. I think the best way is to just make a new column, like this:

df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'b':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  
                   'd':['z','z','z','o','o','o']})
df['e']=0

def f(x):
    y=(x['a']+x['b'])/sum(x['b'])
    return pd.DataFrame({'e':y,'a':x['a'],'b':x['b']})

df.groupby(['c','d']).transform(f)

:

    a   b   e
0   1   1   0.333333
1   2   2   0.666667
2   3   3   1.000000
3   4   4   2.000000
4   5   5   0.909091
5   6   6   1.090909

If you have a very complicated dataframe, you can pick your columns (e.g. df.groupby(['c'])['a','b','e'].transform(f))

This sure looks very inelegant to me, but it's still much faster than apply on large datasets.

Another alternative is to use set_index to capture all the columns you need and then pass just one column to transform.


Circa Pandas version 0.18, it appears the original answer (below) no longer works.

Instead, if you need to do a groupby computation across multiple columns, do the multi-column computation first, and then the groupby:

df = pd.DataFrame({'a':[1,2,3,4,5,6],
                   'b':[1,2,3,4,5,6],
                   'c':['q', 'q', 'q', 'q', 'w', 'w'],  
                   'd':['z','z','z','o','o','o']})
df['e'] = df['a'] + df['b']
df['e'] = (df.groupby(['c', 'd'])['e'].transform('sum'))
print(df)

yields

   a  b  c  d   e
0  1  1  q  z  12
1  2  2  q  z  12
2  3  3  q  z  12
3  4  4  q  o   8
4  5  5  w  o  22
5  6  6  w  o  22

Original answer:

The error message:

TypeError: cannot concatenate a non-NDFrame object

suggests that in order to concatenate, the foo_function should return an NDFrame (such as a Series or DataFrame). If you return a Series, then:

In [99]: df.groupby(['c', 'd']).transform(lambda x: pd.Series(np.sum(x['a']+x['b'])))
Out[99]: 
    a   b
0  12  12
1  12  12
2  12  12
3   8   8
4  22  22
5  22  22

Tags:

Python

Pandas