Pandas transform() vs apply()

Just adding another illustrative example with sum as I find it more explicit:

df = (
    pd.DataFrame(pd.np.random.rand(10, 3), columns=['a', 'b', 'c'])
        .assign(a=lambda df: df.a > 0.5)
)

Out[70]: 
       a         b         c
0  False  0.126448  0.487302
1  False  0.615451  0.735246
2  False  0.314604  0.585689
3  False  0.442784  0.626908
4  False  0.706729  0.508398
5  False  0.847688  0.300392
6  False  0.596089  0.414652
7  False  0.039695  0.965996
8   True  0.489024  0.161974
9  False  0.928978  0.332414

df.groupby('a').apply(sum)  # drop rows

         a         b         c
a                             
False  0.0  4.618465  4.956997
True   1.0  0.489024  0.161974


df.groupby('a').transform(sum)  # keep dims

          b         c
0  4.618465  4.956997
1  4.618465  4.956997
2  4.618465  4.956997
3  4.618465  4.956997
4  4.618465  4.956997
5  4.618465  4.956997
6  4.618465  4.956997
7  4.618465  4.956997
8  0.489024  0.161974
9  4.618465  4.956997

However when applied to pd.DataFrame and not pd.GroupBy object I was not able to see any difference.

It looks like SeriesGroupBy.transform() tries to cast the result dtype to the same one as the original column has, but DataFrameGroupBy.transform() doesn't seem to do that:

In [139]: df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
Out[139]:
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    0
8    0
9    1
Name: cat, dtype: int64

#                         v       v
In [140]: df.groupby('id')[['cat']].transform(lambda x: (x == 1).any())
Out[140]:
     cat
0   True
1   True
2   True
3   True
4   True
5   True
6   True
7  False
8  False
9   True

In [141]: df.dtypes
Out[141]:
cat    int64
id     int64
dtype: object

Pandas transform() vs apply()

Tags:

Python

Pandas

Transform

Apply

Related

Recent Posts