Efficiently replace values from a column to another column Pandas DataFrame
Using np.where
is faster. Using a similar pattern as you used with replace
:
df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
However, using a nested np.where
is slightly faster:
df['col1'] = np.where(df['col1'] == 0,
np.where(df['col2'] == 0, df['col3'], df['col2']),
df['col1'])
Timings
Using the following setup to produce a larger sample DataFrame and timing functions:
df = pd.concat([df]*10**4, ignore_index=True)
def root_nested(df):
df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])
return df
def root_split(df):
df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
return df
def pir2(df):
df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
return df
def pir2_2(df):
slc = (df.values != 0).argmax(axis=1)
return df.values[np.arange(slc.shape[0]), slc]
def andrew(df):
df.col1[df.col1 == 0] = df.col2
df.col1[df.col1 == 0] = df.col3
return df
def pablo(df):
df['col1'] = df['col1'].replace(0,df['col2'])
df['col1'] = df['col1'].replace(0,df['col3'])
return df
I get the following timings:
%timeit root_nested(df.copy())
100 loops, best of 3: 2.25 ms per loop
%timeit root_split(df.copy())
100 loops, best of 3: 2.62 ms per loop
%timeit pir2(df.copy())
100 loops, best of 3: 6.25 ms per loop
%timeit pir2_2(df.copy())
1 loop, best of 3: 2.4 ms per loop
%timeit andrew(df.copy())
100 loops, best of 3: 8.55 ms per loop
I tried timing your method, but it's been running for multiple minutes without completing. As a comparison, timing your method on just the 6 row example DataFrame (not the much larger one tested above) took 12.8 ms.
I'm not sure if it's faster, but you're right that you can slice the dataframe to get your desired result.
df.col1[df.col1 == 0] = df.col2
df.col1[df.col1 == 0] = df.col3
print(df)
Output:
col1 col2 col3
0 0.2 0.3 0.3
1 0.2 0.3 0.3
2 0.4 0.4 0.4
3 0.3 0.0 0.3
4 0.0 0.0 0.0
5 0.1 0.4 0.4
Alternatively if you want it to be more terse (though I don't know if it's faster) you can combine what you did with what I did.
df.col1[df.col1 == 0] = df.col2.replace(0, df.col3)
print(df)
Output:
col1 col2 col3
0 0.2 0.3 0.3
1 0.2 0.3 0.3
2 0.4 0.4 0.4
3 0.3 0.0 0.3
4 0.0 0.0 0.0
5 0.1 0.4 0.4
approach using pd.DataFrame.where
and pd.DataFrame.bfill
df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
df
Another approach using np.argmax
def pir2(df):
slc = (df.values != 0).argmax(axis=1)
return df.values[np.arange(slc.shape[0]), slc]
I know there is a better way to use numpy
to slice. I just can't think of it at the moment.