vectorize conditional assignment in pandas dataframe
One simple method would be to assign the default value first and then perform 2 loc
calls:
In [66]:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
Out[66]:
x
0 0
1 -3
2 5
3 -1
4 1
In [69]:
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
df
Out[69]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
If you wanted to use np.where
then you could do it with a nested np.where
:
In [77]:
df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
df
Out[77]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
So here we define the first condition as where x is less than -2, return 1, then we have another np.where
which tests the other condition where x is greater than 2 and returns -1, otherwise return 0
timings
In [79]:
%timeit df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
1000 loops, best of 3: 1.79 ms per loop
In [81]:
%%timeit
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
100 loops, best of 3: 3.27 ms per loop
So for this sample dataset the np.where
method is twice as fast
Use np.select
for multiple conditions
np.select(condlist, choicelist, default=0)
- Return elements in
choicelist
depending on the corresponding condition incondlist
.- The
default
element is used when all conditions evaluate toFalse
.
condlist = [
df['x'] < -2,
df['x'] > 2,
]
choicelist = [
1,
-1,
]
df['y'] = np.select(condlist, choicelist, default=0)
np.select
is much more readable than a nested np.where
but just as fast:
df = pd.DataFrame({'x': np.random.randint(-5, 5, size=n)})
This is a good use case for pd.cut
where you define ranges and based on those ranges
you can assign labels
:
df['y'] = pd.cut(df['x'], [-np.inf, -2, 2, np.inf], labels=[1, 0, -1], right=False)
Output
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0