Filter out rows with more than certain number of NaN
The following should work
df.dropna(thresh=2)
See the online docs
What we are doing here is dropping any NaN
rows, where there are 2 or more non NaN
values in a row.
Example:
In [25]:
import pandas as pd
df = pd.DataFrame({'a':[1,2,NaN,4,5], 'b':[NaN,2,NaN,4,5], 'c':[1,2,NaN,NaN,NaN], 'd':[1,2,3,NaN,5]})
df
Out[25]:
a b c d
0 1 NaN 1 1
1 2 2 2 2
2 NaN NaN NaN 3
3 4 4 NaN NaN
4 5 5 NaN 5
[5 rows x 4 columns]
In [26]:
df.dropna(thresh=2)
Out[26]:
a b c d
0 1 NaN 1 1
1 2 2 2 2
3 4 4 NaN NaN
4 5 5 NaN 5
[4 rows x 4 columns]
EDIT
For the above example it works but you should note that you would have to know the number of columns and set the thresh
value appropriately, I thought originally it meant the number of NaN
values but it actually means number of Non NaN
values.
You have phrased 2 slightly different questions here. In the general case, they have different answers.
I would like to keep only those rows where at least 2 columns have finite values.
df = df.dropna(thresh=2)
This keeps rows with 2 or more non-null values.
I would like to filter out all the rows that have more than 2
NaNs
df = df.dropna(thresh=df.shape[1]-2)
This filters out rows with 2 or more null values.
In your example dataframe of 4 columns, these operations are equivalent, since df.shape[1] - 2 == 2
. However, you will notice discrepancies with dataframes which do not have exactly 4 columns.
Note dropna
also has a subset
argument should you wish to include only specified columns when applying a threshold. For example:
df = df.dropna(subset=['col1', 'col2', 'col3'], thresh=2)