Count unique values with pandas per groups
You need nunique
:
df = df.groupby('domain')['ID'].nunique()
print (df)
domain
'facebook.com' 1
'google.com' 1
'twitter.com' 2
'vk.com' 3
Name: ID, dtype: int64
If you need to strip
'
characters:
df = df.ID.groupby([df.domain.str.strip("'")]).nunique()
print (df)
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
Name: ID, dtype: int64
Or as Jon Clements commented:
df.groupby(df.domain.str.strip("'"))['ID'].nunique()
You can retain the column name like this:
df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})
print(df)
domain ID
0 fb 1
1 ggl 1
2 twitter 2
3 vk 3
The difference is that nunique()
returns a Series and agg()
returns a DataFrame.
Generally to count distinct values in single column, you can use Series.value_counts
:
df.domain.value_counts()
#'vk.com' 5
#'twitter.com' 2
#'facebook.com' 1
#'google.com' 1
#Name: domain, dtype: int64
To see how many unique values in a column, use Series.nunique
:
df.domain.nunique()
# 4
To get all these distinct values, you can use unique
or drop_duplicates
, the slight difference between the two functions is that unique
return a numpy.array
while drop_duplicates
returns a pandas.Series
:
df.domain.unique()
# array(["'vk.com'", "'twitter.com'", "'facebook.com'", "'google.com'"], dtype=object)
df.domain.drop_duplicates()
#0 'vk.com'
#2 'twitter.com'
#4 'facebook.com'
#6 'google.com'
#Name: domain, dtype: object
As for this specific problem, since you'd like to count distinct value with respect to another variable, besides groupby
method provided by other answers here, you can also simply drop duplicates firstly and then do value_counts()
:
import pandas as pd
df.drop_duplicates().domain.value_counts()
# 'vk.com' 3
# 'twitter.com' 2
# 'facebook.com' 1
# 'google.com' 1
# Name: domain, dtype: int64
df.domain.value_counts()
>>> df.domain.value_counts()
vk.com 5
twitter.com 2
google.com 1
facebook.com 1
Name: domain, dtype: int64