pandas df.apply unexpectedly changes dataframe inplace
Maybe late but I think it may help especially for someone who reach this question.
When we use the foo
like:
def foo(row: pd.Series):
row['b'] = '42'
and then use it in:
df.apply(foo, axis=1)
we won't expect to occur any change in df
but it occers. why?
Let's review what happens under the hood:
apply
function calls foo
and pass one row to it. As it is not of type of specific types
in python (like int, float, str, ...) but is an object, so by python rules it is passed by reference not by value. So it is completely equivalent with the row that is sent by apply
function.(Equal in values and both points to same block of ram.)
So any change to row
in foo
function will changes the row
- which its type is pandas.series
and that points to a block of memory that df.row
resides - immediately.
We can rewrite the foo
(I name it bar
) function to not change anything inplace. ( by deep copying row
that means make another row with same value(s) but on another cell of ram). This is what relly happens when we use lambda
in apply
function.
def bar(row: pd.Series):
row_temp=row.copy(deep=True)
row_temp['b'] = '42'
return row_temp
Complete Code
import pandas as pd
#Changes df in place -- not like lamda
def foo(row: pd.Series):
row['b'] = '42'
#Do not change df inplace -- works like lambda
def bar(row: pd.Series):
row_temp = row.copy(deep=True)
row_temp['b'] = '42'
return row_temp
df2 = pd.DataFrame(columns=['a', 'b'])
df2['a'] = ['a0', 'a1']
df2['b'] = ['b0', 'b1']
print(df2)
# No change inplace
df_b = df2.apply(bar, axis=1)
print(df2)
# bar function works
print(df_b)
print(df2)
# Changes inplace
df2.apply(foo, axis=1)
print(df2)
Output
#df2 before any change
a b
0 a0 b0
1 a1 b1
#calling df2.apply(bar, axis=1) not changed df2 inplace
a b
0 a0 b0
1 a1 b1
#df_b = df2.apply(bar, axis=1) #bar is working as expected
a b
0 a0 42
1 a1 42
#print df2 again to assure it is not changed
a b
0 a0 b0
1 a1 b1
#call df2.apply(foo, axis=1) -- as we see foo changed df2 inplace ( to compare with bar)
a b
0 a0 42
1 a1 42
Interesting question! I believe the behavior you're seeing is an artifact of the way you use apply
.
As you correctly indicate, apply
is not intended to be used to modify a dataframe. However, since apply
takes an arbitrary function, it doesn't guarantee that applying the function will be idempotent and will not change the dataframe. Here, you've found a great example of that behavior, because your function foo
attempts to modify the row that it is passed by apply
.
Using apply
to modify a row could lead to these side effects. This isn't the best practice.
Instead, consider this idiomatic approach for apply
. The function apply
is often used to create a new column. Here's an example of how apply
is typically used, which I believe would steer you away from this potentially troublesome area:
import pandas as pd
# construct df2 just like you did
df2 = pd.DataFrame(columns=['a', 'b'])
df2['a'] = ['a0','b0']
df2['b'] = ['a1','b1']
df2['b_copy'] = df2.apply(lambda row: row['b'], axis=1) # apply to each row
df2['b_replace'] = df2.apply(lambda row: '42', axis=1)
df2['b_reverse'] = df2['b'].apply(lambda val: val[::-1]) # apply to each value in b column
print(df2)
# output:
# a b b_copy b_replace b_reverse
# 0 a0 a1 a1 42 1a
# 1 b0 b1 b1 42 1b
Notice that pandas passed a row or a cell to the function you give as the first argument to apply
, then stores the function's output in a column of your choice.
If you'd like to modify a dataframe row-by-row, take a look at iterrows
and loc
for the most idiomatic route.