Delete group if NaN is present anywhere in multiple columns
Using query
in pandas checking null and finding unique
county
county = data.query("Base_2011.isnull() or Base_2007.isnull()", engine='python').County.unique()
select all the row with remaining county from the list
data[~data.County.isin(county)]
State Year Base_2007 Base_2011 County
7 DM 2013 34.0 45.0 Dummy
8 DM 2012 34.0 45.0 Dummy
I tested this on the below dataset (this also requires the NA
to be replaced by np.nan
as df = df.replace('NA', np.nan)
if they are strings)
print(df)
State Year Base_2007 Base_2011 County
0 AL 2012 NaN 14.0 Alabama_Country
1 AL 2013 12.0 20.0 Alabama_Country
2 AL 2014 13.0 NaN Alabama_Country
3 DC 2011 NaN 20.0 Trenton
4 DC 2012 19.0 NaN Trenton
5 DC 2013 20.0 21.0 Trenton
6 DC 2014 25.0 30.0 Trenton
7 DM 2013 34.0 45.0 Dummy
8 DM 2012 34.0 45.0 Dummy
Removing County
containing NaN works by using :
df_new=df.loc[~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().\
any(axis=1),'County'])]
print(df_new)
State Year Base_2007 Base_2011 County
7 DM 2013 34.0 45.0 Dummy
8 DM 2012 34.0 45.0 Dummy
I will update the explanation shortly.
Explanation
The following finds any NaN rows based on subset of
Base_2007
andBase_2011
df[['Base_2007','Base_2011']].isna().any(axis=1)
0 True
1 False
2 True
3 True
4 True
5 False
6 False
7 False
8 False
Taking the above output as a boolean mask , we call the df.loc[]
function as:
df.loc[df[['Base_2007','Base_2011']].isna().any(axis=1),'County']
which gives:
0 Alabama_Country
2 Alabama_Country
3 Trenton
4 Trenton
Note we are taking only the County
column under **df.loc[]**
. the reason for this is the next step.
We take the output above and find if any of cells in County
column from the original dataframe exists in the output we got above by using s.isin()
This returns True for rows in County
which is present in the output of df.loc[]
.
Then we negate them by an invert ~
which turns all True
to False
and vice versa.
~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().any(axis=1),'County'])
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 True
Once we have this ready, we apply the same logic as df.loc[]
.
Finally we get the dataframe which returns only those County which doesnot have a NaN in Base_2007
and Base_2011
.
Note : if we want the index to start from 0 and not the slice of the dataframe, we can add a reset_index(drop=True)
to the end of the code as:
df_new=df.loc[~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().\
any(axis=1),'County'])].reset_index(drop=True)
State Year Base_2007 Base_2011 County
0 DM 2013 34.0 45.0 Dummy
1 DM 2012 34.0 45.0 Dummy