How to remove common rows in two dataframes in Pandas?
You can use the index.difference()
function
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.randn(5, 2), index= ['row' + str(i) for i in range(1, 6)])
df1
0 1
row1 0.249451 -0.107651
row2 1.295390 -1.773707
row3 -0.893647 -0.683306
row4 -1.090551 0.016833
row5 0.864612 0.369138
df2 = pd.DataFrame(np.random.randn(2, 2), index= ['row' + str(i) for i in [2, 5]])
df2
0 1
row2 0.549396 -0.675574
row5 1.348785 0.942216
df1.loc[df1.index.difference(df2.index), ]
0 1
row1 0.249451 -0.107651
row3 -0.893647 -0.683306
row4 -1.090551 0.016833
You can use pandas.concat
to concatenate the two dataframes rowwise, followed by drop_duplicates
to remove all the duplicated rows in them.
In [1]: import pandas as pd
df_1 = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df_2 = pd.DataFrame({"A":["foo", "bar", "foo", "bar"], "B":[1,0,1,0], "C":["A","B","A","B"]})
In [2]: df = pd.concat([df_1, df_2])
In [3]: df
Out[3]:
A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A
0 foo 1 A
1 bar 0 B
2 foo 1 A
3 bar 0 B
In [4]: df.drop_duplicates(keep=False)
Out[4]:
A B C
0 foo 0 A
2 foo 1 B
3 bar 1 A
This is the best way to do it:
df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(),
how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']
Note that drop duplicated is used to minimize the comparisons. It would work without them as well.
Why this is the best way?
The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results.
- index.difference only works for unique index based comparisons
pandas.concat()
coupled withdrop_duplicated()
is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.