Pandas function: DataFrame.apply() runs top row twice
This is by design, as described here and here
The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. Apply is a shortcut that intelligently applies aggregate, transform or filter. You can try breaking apart your function like so to avoid the duplicate calls.
I sincerely don't see any explanation on this in the provided links, but anyway: I stumbled upon the same in my code, and did the silliest thing, i.e. short-circuit the first call. But it worked.
is_first_call = True
def refill_uniform(row, st=600):
nonlocal is_first_call
if is_first_call:
is_first_call = False
return row
... here goes the code
I faced the same issue today and I spend few hours on google searching for solution. Finally I come up with a work around like this:
import numpy as np
import pandas as pd
import time
def foo(text):
text = str(text) + ' is processed'
return text
def func1(data):
print("run1")
return foo(data['text'])
def func2(data):
print("run2")
data['text'] = data['text'] + ' is processed'
return data
def test_one():
data = pd.DataFrame(columns=['text'], index=np.arange(0, 3))
data['text'] = 'text'
start = time.time()
data = data.apply(func1, axis = 1)
print(time.time() - start)
print(data)
def test_two():
data = pd.DataFrame(columns=['text'], index=np.arange(0, 3))
data['text'] = 'text'
start = time.time()
data = data.apply(func2, axis=1)
print(time.time() - start)
print(data)
test_one()
test_two()
if you run the program you will see the result like this:
run1
run1
run1
0.0029706954956054688
0 text is processed
1 text is processed
2 text is processed
dtype: object
run2
run2
run2
run2
0.0049877166748046875
text
0 text is processed is processed
1 text is processed
2 text is processed
By splitting the function (func2) into func1 and foo, it runs the first row once only.