Delete group if NaN is present anywhere in multiple columns

Using query in pandas checking null and finding unique county

county = data.query("Base_2011.isnull() or Base_2007.isnull()", engine='python').County.unique()

select all the row with remaining county from the list

data[~data.County.isin(county)]

State   Year    Base_2007   Base_2011   County
7   DM  2013    34.0    45.0    Dummy
8   DM  2012    34.0    45.0    Dummy

I tested this on the below dataset (this also requires the NA to be replaced by np.nan as df = df.replace('NA', np.nan) if they are strings)

print(df)

  State  Year  Base_2007  Base_2011           County
0    AL  2012        NaN       14.0  Alabama_Country
1    AL  2013       12.0       20.0  Alabama_Country
2    AL  2014       13.0        NaN  Alabama_Country
3    DC  2011        NaN       20.0          Trenton
4    DC  2012       19.0        NaN          Trenton
5    DC  2013       20.0       21.0          Trenton
6    DC  2014       25.0       30.0          Trenton
7    DM  2013       34.0       45.0            Dummy
8    DM  2012       34.0       45.0            Dummy

Removing County containing NaN works by using :

df_new=df.loc[~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().\
                                        any(axis=1),'County'])]
print(df_new)

  State  Year  Base_2007  Base_2011 County
7    DM  2013       34.0       45.0  Dummy
8    DM  2012       34.0       45.0  Dummy

I will update the explanation shortly.

Explanation

The following finds any NaN rows based on subset of Base_2007 and Base_2011

df[['Base_2007','Base_2011']].isna().any(axis=1)
0     True
1    False
2     True
3     True
4     True
5    False
6    False
7    False
8    False

Taking the above output as a boolean mask , we call the df.loc[] function as:

df.loc[df[['Base_2007','Base_2011']].isna().any(axis=1),'County']

which gives:

0    Alabama_Country
2    Alabama_Country
3            Trenton
4            Trenton

Note we are taking only the County column under **df.loc[]**. the reason for this is the next step.

We take the output above and find if any of cells in County column from the original dataframe exists in the output we got above by using s.isin()

This returns True for rows in County which is present in the output of df.loc[].

Then we negate them by an invert ~ which turns all True to False and vice versa.

~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().any(axis=1),'County'])
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7     True
8     True

Once we have this ready, we apply the same logic as df.loc[] .

Finally we get the dataframe which returns only those County which doesnot have a NaN in Base_2007 and Base_2011.

Note : if we want the index to start from 0 and not the slice of the dataframe, we can add a reset_index(drop=True) to the end of the code as:

df_new=df.loc[~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().\                                 
                    any(axis=1),'County'])].reset_index(drop=True)

   State  Year  Base_2007  Base_2011 County
0    DM  2013       34.0       45.0  Dummy
1    DM  2012       34.0       45.0  Dummy

Delete group if NaN is present anywhere in multiple columns

Tags:

Python

Pandas

Dataframe

Related

Recent Posts