Pandas: group by and Pivot table difference

It's more appropriate to use .pivot_table() instead of .groupby() when you need to show aggregates with both rows and column labels.

.pivot_table() makes it easy to create row and column labels at the same time and is preferable, even though you can get similar results using .groupby() with few extra steps.


Both pivot_table and groupby are used to aggregate your dataframe. The difference is only with regard to the shape of the result.

Using pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum) a table is created where a is on the row axis, b is on the column axis, and the values are the sum of c.

Example:

df = pd.DataFrame({"a": [1,2,3,1,2,3], "b":[1,1,1,2,2,2], "c":np.random.rand(6)})
pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)

b         1         2
a                    
1  0.528470  0.484766
2  0.187277  0.144326
3  0.866832  0.650100

Using groupby, the dimensions given are placed into columns, and rows are created for each combination of those dimensions.

In this example, we create a series of the sum of values c, grouped by all unique combinations of a and b.

df.groupby(['a','b'])['c'].sum()

a  b
1  1    0.528470
   2    0.484766
2  1    0.187277
   2    0.144326
3  1    0.866832
   2    0.650100
Name: c, dtype: float64

A similar usage of groupby is if we omit the ['c']. In this case, it creates a dataframe (not a series) of the sums of all remaining columns grouped by unique values of a and b.

print df.groupby(["a","b"]).sum()
            c
a b          
1 1  0.528470
  2  0.484766
2 1  0.187277
  2  0.144326
3 1  0.866832
  2  0.650100

pivot_table = groupby + unstack and groupby = pivot_table + stack hold True.

In particular, if columns parameter of pivot_table() is not used, then groupby() and pivot_table() both produce the same result (if the same aggregator function is used).

# sample
df = pd.DataFrame({"a": [1,1,1,2,2,2], "b": [1,1,2,2,3,3], "c": [0,0.5,1,1,2,2]})

# example
gb = df.groupby(['a','b'])[['c']].sum()
pt = df.pivot_table(index=['a','b'], values=['c'], aggfunc='sum')

# equality test
gb.equals(pt) #True

In general, if we check the source code, pivot_table() internally calls __internal_pivot_table(). This function creates a single flat list out of index and columns and calls groupby() with this list as the grouper. Then after aggregation, calls unstack() on the list of columns.

If columns are never passed, there is nothing to unstack on, so groupby and pivot_table trivially produce the same output.

A demonstration of this function is:

gb = (
    df
    .groupby(['a','b'])[['c']].sum()
    .unstack(['b'])
)
pt = df.pivot_table(index=['a'], columns=['b'], values=['c'], aggfunc='sum')

gb.equals(pt) # True

As stack() is the inverse operation of unstack(), the following holds True as well:

(
    df
    .pivot_table(index=['a'], columns=['b'], values=['c'], aggfunc='sum')
    .stack(['b'])
    .equals(
        df.groupby(['a','b'])[['c']].sum()
    )
) # True

In conclusion, depending on the use case, one is more convenient than the other but they can both be used instead of the other and after correctly applying stack()/unstack(), both will result in the same output.

Tags:

Python

Pandas