How to find duplicate names using pandas?
Most of the responses given demonstrate how to remove the duplicates, not find them.
The following will select each row in the data frame with a duplicate 'name'
field. Note that this will find each instance, not just duplicates after the first occurrence. The keep
argument accepts additional values that can exclude either the first or last occurrence.
df[df.duplicated(['name'], keep=False)]
The pandas reference for duplicated()
can be found here.
value_counts will give you the number of duplicates as well.
names = df.name.value_counts()
names[names > 1]
A one liner can be:
x.set_index('name').index.get_duplicates()
the index contains a method for finding duplicates, columns does not seem to have a similar method..
If you want to find the rows with duplicated name (except the first time we see that), you can try this
In [16]: import pandas as pd
In [17]: p1 = {'name': 'willy', 'age': 10}
In [18]: p2 = {'name': 'willy', 'age': 11}
In [19]: p3 = {'name': 'zoe', 'age': 10}
In [20]: df = pd.DataFrame([p1, p2, p3])
In [21]: df
Out[21]:
age name
0 10 willy
1 11 willy
2 10 zoe
In [22]: df.duplicated('name')
Out[22]:
0 False
1 True
2 False