How to delete rows from a pandas DataFrame based on a conditional expression
When you do len(df['column name'])
you are just getting one number, namely the number of rows in the DataFrame (i.e., the length of the column itself). If you want to apply len
to each element in the column, use df['column name'].map(len)
. So try
df[df['column name'].map(len) < 2]
To directly answer this question's original title "How to delete rows from a pandas DataFrame based on a conditional expression" (which I understand is not necessarily the OP's problem but could help other users coming across this question) one way to do this is to use the drop method:
df = df.drop(some labels)
df = df.drop(df[<some boolean condition>].index)
Example
To remove all rows where column 'score' is < 50:
df = df.drop(df[df.score < 50].index)
In place version (as pointed out in comments)
df.drop(df[df.score < 50].index, inplace=True)
Multiple conditions
(see Boolean Indexing)
The operators are:
|
foror
,&
forand
, and~
fornot
. These must be grouped by using parentheses.
To remove all rows where column 'score' is < 50 and > 20
df = df.drop(df[(df.score < 50) & (df.score > 20)].index)
You can assign the DataFrame
to a filtered version of itself:
df = df[df.score > 50]
This is faster than drop
:
%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test[test.x < 0]
# 54.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test.drop(test[test.x > 0].index, inplace=True)
# 201 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test.drop(test[test.x > 0].index)
# 194 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)