Aggregation in pandas
If you are coming from an R or SQL background, here are three examples that will teach you everything you need to do aggregation the way you are already familiar with:
Let us first create a Pandas dataframe
import pandas as pd
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'key2' : ['c','c','d','d','e'],
'value1' : [1,2,2,3,3],
'value2' : [9,8,7,6,5]})
df.head(5)
Here is how the table we created looks like:
key1 | key2 | value1 | value2 |
---|---|---|---|
a | c | 1 | 9 |
a | c | 2 | 8 |
a | d | 2 | 7 |
b | d | 3 | 6 |
a | e | 3 | 5 |
1. Aggregating With Row Reduction Similar to SQL Group By
1.1 If Pandas version >=0.25
Check your Pandas version by running print(pd.__version__)
. If your Pandas version is 0.25 or above then the following code will work:
df_agg = df.groupby(['key1','key2']).agg(mean_of_value_1=('value1', 'mean'),
sum_of_value_2=('value2', 'sum'),
count_of_value1=('value1','size')
).reset_index()
df_agg.head(5)
The resulting data table will look like this:
key1 | key2 | mean_of_value1 | sum_of_value2 | count_of_value1 |
---|---|---|---|---|
a | c | 1.5 | 17 | 2 |
a | d | 2.0 | 7 | 1 |
a | e | 3.0 | 5 | 1 |
b | d | 3.0 | 6 | 1 |
The SQL equivalent of this is:
SELECT
key1
,key2
,AVG(value1) AS mean_of_value_1
,SUM(value2) AS sum_of_value_2
,COUNT(*) AS count_of_value1
FROM
df
GROUP BY
key1
,key2
1.2 If Pandas version <0.25
If your Pandas version is older than 0.25 then running the above code will give you the following error:
TypeError: aggregate() missing 1 required positional argument: 'arg'
Now to do the aggregation for both value1
and value2
, you will run this code:
df_agg = df.groupby(['key1','key2'],as_index=False).agg({'value1':['mean','count'],'value2':'sum'})
df_agg.columns = ['_'.join(col).strip() for col in df_agg.columns.values]
df_agg.head(5)
The resulting table will look like this:
key1 | key2 | value1_mean | value1_count | value2_sum |
---|---|---|---|---|
a | c | 1.5 | 2 | 17 |
a | d | 2.0 | 1 | 7 |
a | e | 3.0 | 1 | 5 |
b | d | 3.0 | 1 | 6 |
Renaming the columns needs to be done separately using the below code:
df_agg.rename(columns={"value1_mean" : "mean_of_value1",
"value1_count" : "count_of_value1",
"value2_sum" : "sum_of_value2"
}, inplace=True)
2. Create a Column Without Reduction in Rows (EXCEL - SUMIF, COUNTIF
)
If you want to do a SUMIF, COUNTIF, etc., like how you would do in Excel where there is no reduction in rows, then you need to do this instead.
df['Total_of_value1_by_key1'] = df.groupby('key1')['value1'].transform('sum')
df.head(5)
The resulting data frame will look like this with the same number of rows as the original:
key1 | key2 | value1 | value2 | Total_of_value1_by_key1 |
---|---|---|---|---|
a | c | 1 | 9 | 8 |
a | c | 2 | 8 | 8 |
a | d | 2 | 7 | 8 |
b | d | 3 | 6 | 3 |
a | e | 3 | 5 | 8 |
3. Creating a RANK Column ROW_NUMBER() OVER (PARTITION BY ORDER BY)
Finally, there might be cases where you want to create a rank column which is the SQL equivalent of ROW_NUMBER() OVER (PARTITION BY key1 ORDER BY value1 DESC, value2 ASC)
.
Here is how you do that.
df['RN'] = df.sort_values(['value1','value2'], ascending=[False,True]) \
.groupby(['key1']) \
.cumcount() + 1
df.head(5)
Note: we make the code multi-line by adding \
at the end of each line.
Here is how the resulting data frame looks like:
key1 | key2 | value1 | value2 | RN |
---|---|---|---|---|
a | c | 1 | 9 | 4 |
a | c | 2 | 8 | 3 |
a | d | 2 | 7 | 2 |
b | d | 3 | 6 | 1 |
a | e | 3 | 5 | 1 |
In all the examples above, the final data table will have a table structure and won't have the pivot structure that you might get in other syntaxes.
Other aggregating operators:
mean()
Compute mean of groups
sum()
Compute sum of group values
size()
Compute group sizes
count()
Compute count of group
std()
Standard deviation of groups
var()
Compute variance of groups
sem()
Standard error of the mean of groups
describe()
Generates descriptive statistics
first()
Compute first of group values
last()
Compute last of group values
nth()
Take nth value, or a subset if n is a list
min()
Compute min of group values
max()
Compute max of group values
Question 1
How can I perform aggregation with Pandas?
Expanded aggregation documentation.
Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original.
Some common aggregating functions are tabulated below:
Function Description mean() Compute mean of groups sum() Compute sum of group values size() Compute group sizes count() Compute count of group std() Standard deviation of groups var() Compute variance of groups sem() Standard error of the mean of groups describe() Generates descriptive statistics first() Compute first of group values last() Compute last of group values nth() Take nth value, or a subset if n is a list min() Compute min of group values max() Compute max of group values
np.random.seed(123)
df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
'B' : ['one', 'two', 'three','two', 'two', 'one'],
'C' : np.random.randint(5, size=6),
'D' : np.random.randint(5, size=6),
'E' : np.random.randint(5, size=6)})
print (df)
A B C D E
0 foo one 2 3 0
1 foo two 4 1 0
2 bar three 2 1 1
3 foo two 1 0 3
4 bar two 3 1 4
5 foo one 2 1 0
Aggregation by filtered columns and Cython implemented functions:
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
An aggregate function is used for all columns without being specified in the groupby
function, here the A, B
columns:
df2 = df.groupby(['A', 'B'], as_index=False).sum()
print (df2)
A B C D E
0 bar three 2 1 1
1 bar two 3 1 4
2 foo one 4 4 0
3 foo two 5 1 3
You can also specify only some columns used for aggregation in a list after the groupby
function:
df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
print (df3)
A B C D
0 bar three 2 1
1 bar two 3 1
2 foo one 4 4
3 foo two 5 1
Same results by using function DataFrameGroupBy.agg
:
df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
print (df2)
A B C D E
0 bar three 2 1 1
1 bar two 3 1 4
2 foo one 4 4 0
3 foo two 5 1 3
For multiple functions applied for one column use a list of tuple
s - names of new columns and aggregated functions:
df4 = (df.groupby(['A', 'B'])['C']
.agg([('average','mean'),('total','sum')])
.reset_index())
print (df4)
A B average total
0 bar three 2.0 2
1 bar two 3.0 3
2 foo one 2.0 4
3 foo two 2.5 5
If want to pass multiple functions is possible pass list
of tuple
s:
df5 = (df.groupby(['A', 'B'])
.agg([('average','mean'),('total','sum')]))
print (df5)
C D E
average total average total average total
A B
bar three 2.0 2 1.0 1 1.0 1
two 3.0 3 1.0 1 4.0 4
foo one 2.0 4 2.0 4 0.0 0
two 2.5 5 0.5 1 1.5 3
Then get MultiIndex
in columns:
print (df5.columns)
MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
And for converting to columns, flattening MultiIndex
use map
with join
:
df5.columns = df5.columns.map('_'.join)
df5 = df5.reset_index()
print (df5)
A B C_average C_total D_average D_total E_average E_total
0 bar three 2.0 2 1.0 1 1.0 1
1 bar two 3.0 3 1.0 1 4.0 4
2 foo one 2.0 4 2.0 4 0.0 0
3 foo two 2.5 5 0.5 1 1.5 3
Another solution is pass list of aggregate functions, then flatten MultiIndex
and for another columns names use str.replace
:
df5 = df.groupby(['A', 'B']).agg(['mean','sum'])
df5.columns = (df5.columns.map('_'.join)
.str.replace('sum','total')
.str.replace('mean','average'))
df5 = df5.reset_index()
print (df5)
A B C_average C_total D_average D_total E_average E_total
0 bar three 2.0 2 1.0 1 1.0 1
1 bar two 3.0 3 1.0 1 4.0 4
2 foo one 2.0 4 2.0 4 0.0 0
3 foo two 2.5 5 0.5 1 1.5 3
If want specified each column with aggregated function separately pass dictionary
:
df6 = (df.groupby(['A', 'B'], as_index=False)
.agg({'C':'sum','D':'mean'})
.rename(columns={'C':'C_total', 'D':'D_average'}))
print (df6)
A B C_total D_average
0 bar three 2 1.0
1 bar two 3 1.0
2 foo one 4 2.0
3 foo two 5 0.5
You can pass custom function too:
def func(x):
return x.iat[0] + x.iat[-1]
df7 = (df.groupby(['A', 'B'], as_index=False)
.agg({'C':'sum','D': func})
.rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))
print (df7)
A B C_total D_sum_first_and_last
0 bar three 2 2
1 bar two 3 2
2 foo one 4 4
3 foo two 5 1
Question 2
No DataFrame after aggregation! What happened?
Aggregation by two or more columns:
df1 = df.groupby(['A', 'B'])['C'].sum()
print (df1)
A B
bar three 2
two 3
foo one 4
two 5
Name: C, dtype: int32
First check the Index
and type
of a Pandas object:
print (df1.index)
MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
names=['A', 'B'])
print (type(df1))
<class 'pandas.core.series.Series'>
There are two solutions for how to get MultiIndex Series
to columns:
- add parameter
as_index=False
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
- use
Series.reset_index
:
df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
print (df1)
A B C
0 bar three 2
1 bar two 3
2 foo one 4
3 foo two 5
If group by one column:
df2 = df.groupby('A')['C'].sum()
print (df2)
A
bar 5
foo 9
Name: C, dtype: int32
... get Series
with Index
:
print (df2.index)
Index(['bar', 'foo'], dtype='object', name='A')
print (type(df2))
<class 'pandas.core.series.Series'>
And the solution is the same like in the MultiIndex Series
:
df2 = df.groupby('A', as_index=False)['C'].sum()
print (df2)
A C
0 bar 5
1 foo 9
df2 = df.groupby('A')['C'].sum().reset_index()
print (df2)
A C
0 bar 5
1 foo 9
Question 3
How can I aggregate mainly strings columns (to list
s, tuple
s, strings with separator
)?
df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
'D' : [1,2,3,2,3,1,2]})
print (df)
A B C D
0 a one three 1
1 c two one 2
2 b three two 3
3 b two two 2
4 a two three 3
5 c one two 1
6 b three one 2
Instead of an aggregation function, it is possible to pass list
, tuple
, set
for converting the column:
df1 = df.groupby('A')['B'].agg(list).reset_index()
print (df1)
A B
0 a [one, two]
1 b [three, two, three]
2 c [two, one]
An alternative is use GroupBy.apply
:
df1 = df.groupby('A')['B'].apply(list).reset_index()
print (df1)
A B
0 a [one, two]
1 b [three, two, three]
2 c [two, one]
For converting to strings with a separator, use .join
only if it is a string column:
df2 = df.groupby('A')['B'].agg(','.join).reset_index()
print (df2)
A B
0 a one,two
1 b three,two,three
2 c two,one
If it is a numeric column, use a lambda function with astype
for converting to string
s:
df3 = (df.groupby('A')['D']
.agg(lambda x: ','.join(x.astype(str)))
.reset_index())
print (df3)
A D
0 a 1,3
1 b 3,2,2
2 c 2,1
Another solution is converting to strings before groupby
:
df3 = (df.assign(D = df['D'].astype(str))
.groupby('A')['D']
.agg(','.join).reset_index())
print (df3)
A D
0 a 1,3
1 b 3,2,2
2 c 2,1
For converting all columns, don't pass a list of column(s) after groupby
.
There isn't any column D
, because automatic exclusion of 'nuisance' columns. It means all numeric columns are excluded.
df4 = df.groupby('A').agg(','.join).reset_index()
print (df4)
A B C
0 a one,two three,three
1 b three,two,three two,two,one
2 c two,one one,two
So it's necessary to convert all columns into strings, and then get all columns:
df5 = (df.groupby('A')
.agg(lambda x: ','.join(x.astype(str)))
.reset_index())
print (df5)
A B C D
0 a one,two three,three 1,3
1 b three,two,three two,two,one 3,2,2
2 c two,one one,two 2,1
Question 4
How can I aggregate counts?
df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
'D' : [np.nan,2,3,2,3,np.nan,2]})
print (df)
A B C D
0 a one three NaN
1 c two NaN 2.0
2 b three NaN 3.0
3 b two two 2.0
4 a two three 3.0
5 c one two NaN
6 b three one 2.0
Function GroupBy.size
for size
of each group:
df1 = df.groupby('A').size().reset_index(name='COUNT')
print (df1)
A COUNT
0 a 2
1 b 3
2 c 2
Function GroupBy.count
excludes missing values:
df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
print (df2)
A COUNT
0 a 2
1 b 2
2 c 1
This function should be used for multiple columns for counting non-missing values:
df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
print (df3)
A B_COUNT C_COUNT D_COUNT
0 a 2 2 1
1 b 3 2 3
2 c 2 1 1
A related function is Series.value_counts
. It returns the size of the object containing counts of unique values in descending order, so that the first element is the most frequently-occurring element. It excludes NaN
s values by default.
df4 = (df['A'].value_counts()
.rename_axis('A')
.reset_index(name='COUNT'))
print (df4)
A COUNT
0 b 3
1 a 2
2 c 2
If you want same output like using function groupby
+ size
, add Series.sort_index
:
df5 = (df['A'].value_counts()
.sort_index()
.rename_axis('A')
.reset_index(name='COUNT'))
print (df5)
A COUNT
0 a 2
1 b 3
2 c 2
Question 5
How can I create a new column filled by aggregated values?
Method GroupBy.transform
returns an object that is indexed the same (same size) as the one being grouped.
See the Pandas documentation for more information.
np.random.seed(123)
df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
'B' : ['one', 'two', 'three','two', 'two', 'one'],
'C' : np.random.randint(5, size=6),
'D' : np.random.randint(5, size=6)})
print (df)
A B C D
0 foo one 2 3
1 foo two 4 1
2 bar three 2 1
3 foo two 1 0
4 bar two 3 1
5 foo one 2 1
df['C1'] = df.groupby('A')['C'].transform('sum')
df['C2'] = df.groupby(['A','B'])['C'].transform('sum')
df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')
print (df)
A B C D C1 C2 C3 D3 C4 D4
0 foo one 2 3 9 4 9 5 4 4
1 foo two 4 1 9 5 9 5 5 1
2 bar three 2 1 5 2 5 2 2 1
3 foo two 1 0 9 5 9 5 5 1
4 bar two 3 1 5 3 5 2 3 1
5 foo one 2 1 9 4 9 5 4 4