Pandas transform() vs apply()
Just adding another illustrative example with sum as I find it more explicit:
df = (
pd.DataFrame(pd.np.random.rand(10, 3), columns=['a', 'b', 'c'])
.assign(a=lambda df: df.a > 0.5)
)
Out[70]:
a b c
0 False 0.126448 0.487302
1 False 0.615451 0.735246
2 False 0.314604 0.585689
3 False 0.442784 0.626908
4 False 0.706729 0.508398
5 False 0.847688 0.300392
6 False 0.596089 0.414652
7 False 0.039695 0.965996
8 True 0.489024 0.161974
9 False 0.928978 0.332414
df.groupby('a').apply(sum) # drop rows
a b c
a
False 0.0 4.618465 4.956997
True 1.0 0.489024 0.161974
df.groupby('a').transform(sum) # keep dims
b c
0 4.618465 4.956997
1 4.618465 4.956997
2 4.618465 4.956997
3 4.618465 4.956997
4 4.618465 4.956997
5 4.618465 4.956997
6 4.618465 4.956997
7 4.618465 4.956997
8 0.489024 0.161974
9 4.618465 4.956997
However when applied to pd.DataFrame
and not pd.GroupBy
object I was not able to see any difference.
It looks like SeriesGroupBy.transform()
tries to cast the result dtype to the same one as the original column has, but DataFrameGroupBy.transform()
doesn't seem to do that:
In [139]: df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
Out[139]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 1
Name: cat, dtype: int64
# v v
In [140]: df.groupby('id')[['cat']].transform(lambda x: (x == 1).any())
Out[140]:
cat
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
In [141]: df.dtypes
Out[141]:
cat int64
id int64
dtype: object