How to remove common rows in two dataframes in Pandas?

You can use the index.difference() function

import numpy as np
import pandas as pd

df1 = pd.DataFrame(np.random.randn(5, 2), index= ['row' + str(i) for i in range(1, 6)])
df1

        0             1
row1    0.249451    -0.107651
row2    1.295390    -1.773707
row3    -0.893647   -0.683306
row4    -1.090551   0.016833
row5    0.864612    0.369138

df2 = pd.DataFrame(np.random.randn(2, 2), index= ['row' + str(i) for i in [2, 5]])
df2

        0           1
row2    0.549396    -0.675574
row5    1.348785    0.942216

df1.loc[df1.index.difference(df2.index), ]

        0           1
row1    0.249451    -0.107651
row3    -0.893647   -0.683306
row4    -1.090551   0.016833

You can use pandas.concat to concatenate the two dataframes rowwise, followed by drop_duplicates to remove all the duplicated rows in them.

In [1]: import pandas as pd
df_1 = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df_2 = pd.DataFrame({"A":["foo", "bar", "foo", "bar"], "B":[1,0,1,0], "C":["A","B","A","B"]})

In [2]: df = pd.concat([df_1, df_2])

In [3]: df
Out[3]: 
     A  B  C
0  foo  0  A
1  foo  1  A
2  foo  1  B
3  bar  1  A
0  foo  1  A
1  bar  0  B
2  foo  1  A
3  bar  0  B

In [4]: df.drop_duplicates(keep=False)
Out[4]: 
     A  B  C
0  foo  0  A
2  foo  1  B
3  bar  1  A

This is the best way to do it:

df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(), 
                   how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']

Note that drop duplicated is used to minimize the comparisons. It would work without them as well.

Why this is the best way?

The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results.

index.difference only works for unique index based comparisons
pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.

How to remove common rows in two dataframes in Pandas?

This is the best way to do it:

Why this is the best way?

Tags:

Pandas

Python 2.7

Scikit Learn

Related

Recent Posts