Sorting entire csv by frequency of occurence in one column
Update 2021
The answers proposed by EdChum and Ilya K. does not work anymore.
The function pd.Series.value_counts
returns a Series with the counts of unique values. But the Series that we apply the pd.Series.value_counts
function to itself contain only one unique value due to the fact that we applied groupby
to the DataFrame and split the CompanyName Series into groups of unique values earlier. Thus, the final output after we apply the function would look something like this.
Customer3 4
dtype: int64
Which is nonsense, we cannot transform a value in a Series into a whole Series. Somehow, we only need the integer 4
and not the whole Series.
However, we can take advantage of the groupby
function earlier by counting the number of values in each group, transforming the whole groups into the number of values in that group, and put them together into a final Frequency Series.
We can replace pd.Series.value_counts
with pd.Series.count
or just simply use the function name count
import pandas as pd
df = pd.DataFrame({'CompanyName': {0: 'Customer1', 1: 'Customer1', 2: 'Customer2', 3: 'Customer3', 4: 'Customer1', 5: 'Customer3', 6: 'Customer3', 7: 'Customer3', 8: 'Customer4'}, 'HighPriority': {0: 'Yes', 1: 'Yes', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'Yes', 7: 'Yes', 8: 'No'}, 'QualityIssue': {0: 'User', 1: 'User', 2: 'User', 3: 'Equipment', 4: 'Neither', 5: 'User', 6: 'User', 7: 'Equipment', 8: 'User'}})
df['Frequency'] = df.groupby('CompanyName')['CompanyName'].transform('count')
df.sort_values('Frequency', inplace=True, ascending=False)
Output
>>> df
CompanyName HighPriority QualityIssue Frequency
3 Customer3 No Equipment 4
5 Customer3 No User 4
6 Customer3 Yes User 4
7 Customer3 Yes Equipment 4
0 Customer1 Yes User 3
1 Customer1 Yes User 3
4 Customer1 No Neither 3
2 Customer2 No User 1
8 Customer4 No User 1
I think there must be a better way to do it, but this should work:
Preparing the data:
import io
data = """
CompanyName HighPriority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s+")
And doing the transformation:
# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())
# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")
# output the original data frame in the order of the new index.
df.reindex(new_index.index)
The output:
CompanyName HighPriority QualityIssue
3 Customer3 No Equipment
5 Customer3 No User
6 Customer3 Yes User
7 Customer3 Yes Equipment
0 Customer1 Yes User
1 Customer1 Yes User
4 Customer1 No Neither
8 Customer4 No User
2 Customer2 No User
It's probably not intuitive what happens here, but at the moment I cannot think of a better way to do it. I tried to comment as much as possible.
The tricky part here is that the index of count_df
is the (unique) occurrences of the customers. Therefore, I join the index of count_df
(left_index=True
) with the CompanyName
column of df
(right_on="CompanyName"
).
The magic here is that count_df
is already sorted by the number of occurrences, that's why we need no explicit sorting. So all we have to do is to reorder the rows of the original data frame by the rows of the joined data frame and we get the expected result.
The top-voted answer needs a minor addition: sort
was deprecated in favour of sort_values
and sort_index
.
sort_values
will work like this:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
df['count'] = \
df.groupby('a')['a']\
.transform(pd.Series.value_counts)
df.sort_values('count', inplace=True, ascending=False)
print('df sorted: \n{}'.format(df))
df sorted: a b count 0 1 1 2 2 1 3 2 1 2 2 1
This seems to do what you want, basically add a count column by performing a groupby
and transform
with value_counts
and then you can sort on that column:
df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort_values('count', ascending=False)
Output:
CompanyName HighPriority QualityIssue count
5 Customer3 No User 4
3 Customer3 No Equipment 4
7 Customer3 Yes Equipment 4
6 Customer3 Yes User 4
0 Customer1 Yes User 3
4 Customer1 No Neither 3
1 Customer1 Yes User 3
8 Customer4 No User 1
2 Customer2 No User 1
You can drop the extraneous column using df.drop
:
df.drop('count', axis=1)
Output:
CompanyName HighPriority QualityIssue
5 Customer3 No User
3 Customer3 No Equipment
7 Customer3 Yes Equipment
6 Customer3 Yes User
0 Customer1 Yes User
4 Customer1 No Neither
1 Customer1 Yes User
8 Customer4 No User
2 Customer2 No User