How to count duplicate rows in pandas dataframe?
You can groupby
on all the columns and call size
the index indicates the duplicate values:
In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()
Out[28]:
one three two
False False True 1
True False False 2
True True 1
dtype: int64
df.groupby(df.columns.tolist()).size().reset_index().\
rename(columns={0:'records'})
one two records
0 1 1 2
1 1 2 1
Specific to your question, as the others mentioned fast and easy way would be:
df.groupby(df.columns.tolist(),as_index=False).size()
If you like to count duplicates on particular column(s):
len(df['one'])-len(df['one'].drop_duplicates())
If you want to count duplicates on entire dataframe:
len(df)-len(df.drop_duplicates())
Or simply you can use DataFrame.duplicated(subset=None, keep='first'):
df.duplicated(subset='one', keep='first').sum()
where
subset : column label or sequence of labels(by default use all of the columns)
keep : {‘first’, ‘last’, False}, default ‘first’
- first : Mark duplicates as True except for the first occurrence.
- last : Mark duplicates as True except for the last occurrence.
- False : Mark all duplicates as True.
I use:
used_features =[
"one",
"two",
"three"
]
df['is_duplicated'] = df.duplicated(used_features)
df['is_duplicated'].sum()
which gives count of duplicated rows, and then you can analyse them by a new column. I didn't see such solution here.