Benefits of panda's multiindex?

Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release.

This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.

Imagine constructing a dataframe using MultiIndex like this:-

import pandas as pd
import numpy as np

np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]

df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B'])

df  # This is the dataframe we have generated

          A         B
one 1 -0.732470 -0.313871
    2 -0.031109 -2.068794
    3  1.520652  0.471764
two 1 -0.101713 -1.204458
    2  0.958008 -0.455419
    3 -0.191702 -0.915983

This df is simply a data structure of two dimensions

df.ndim

2

But we can imagine it, looking at the output, as a 3 dimensional data structure.

one with 1 with data -0.732470 -0.313871.
one with 2 with data -0.031109 -2.068794.
one with 3 with data 1.520652 0.471764.

A.k.a.: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure"

This is not just a "pretty display". It has the benefit of easy retrieval of data since we now have a hierarchal index.

For example.

In [44]: df.ix["one"]
Out[44]: 
          A         B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3  1.520652  0.471764

will give us a new data frame only for the group of data belonging to "one".

And we can narrow down our data selection further by doing this:-

In [45]: df.ix["one"].ix[1]
Out[45]: 
A   -0.732470
B   -0.313871
Name: 1

And of course, if we want a specific value, here's an example:-

In [46]: df.ix["one"].ix[1]["A"]
Out[46]: -0.73247029752040727

So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby.

We can even grab a cross-section (either rows or columns) from our dataframe...

By rows:-

In [47]: df.xs('one')
Out[47]: 
          A         B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3  1.520652  0.471764

By columns:-

In [48]: df.xs('B', axis=1)
Out[48]: 
one  1   -0.313871
     2   -2.068794
     3    0.471764
two  1   -1.204458
     2   -0.455419
     3   -0.915983
Name: B

Great post by @Calvin Cheng, but thought I'd take a stab at this as well.

When to use a MultiIndex:

When a single column’s value isn’t enough to uniquely identify a row.
When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”

Why (your core question) - at least these are the biggest benefits IMO:

Easy manipulation via stack() and unstack()
Easy math when there are multiple column levels
Syntactic sugar for slicing/filtering

Example:

                                                       Dollars  Units
Date       Store   Category Subcategory UPC EAN
2018-07-10 Store 1 Alcohol  Liqour      80480280024    154.77      7
           Store 2 Alcohol  Liqour      80480280024     82.08      4
           Store 3 Alcohol  Liqour      80480280024    259.38      9
           Store 1 Alcohol  Liquor      80432400630    477.68     14
                                        674545000001   139.68      4
           Store 2 Alcohol  Liquor      80432400630    203.88      6
                                        674545000001   377.13     13
           Store 3 Alcohol  Liquor      80432400630    239.19      7
                                        674545000001   432.32     14
           Store 1 Beer     Ales        94922755711     65.17      7
                                        702770082018   174.44     14
                                        736920111112    50.70      5
           Store 2 Beer     Ales        94922755711    129.60     12
                                        702770082018   107.40     10
                                        736920111112    59.65      5
           Store 3 Beer     Ales        94922755711    154.00     14
                                        702770082018   137.40     10
                                        736920111112   107.88     12
           Store 1 Beer     Lagers      702770081011   156.24     12
           Store 2 Beer     Lagers      702770081011   137.06     11
           Store 3 Beer     Lagers      702770081011   119.52      8

1) If we want to easily compare sales across stores, we can use df.unstack('Store') to line everything up side-by-side:

                                             Dollars                   Units
Store                                        Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date       Category Subcategory UPC EAN
2018-07-10 Alcohol  Liqour      80480280024   154.77   82.08  259.38       7       4       9
                    Liquor      80432400630   477.68  203.88  239.19      14       6       7
                                674545000001  139.68  377.13  432.32       4      13      14
           Beer     Ales        94922755711    65.17  129.60  154.00       7      12      14
                                702770082018  174.44  107.40  137.40      14      10      10
                                736920111112   50.70   59.65  107.88       5       5      12
                    Lagers      702770081011  156.24  137.06  119.52      12      11       8

2) We can also easily do math on multiple columns. For example, df['Dollars'] / df['Units'] will then divide each store's dollars by its units, for every store without multiple operations:

Store                                         Store 1  Store 2  Store 3
Date       Category Subcategory UPC EAN
2018-07-10 Alcohol  Liqour      80480280024     22.11    20.52    28.82
                    Liquor      80432400630     34.12    33.98    34.17
                                674545000001    34.92    29.01    30.88
           Beer     Ales        94922755711      9.31    10.80    11.00
                                702770082018    12.46    10.74    13.74
                                736920111112    10.14    11.93     8.99
                    Lagers      702770081011    13.02    12.46    14.94

3) If we then want to filter to just specific rows, instead of using the

df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)]

format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful). The syntax would instead be:

df.xs((val1, val2, val3), level=(col1, col2, col3))

More examples can be found in this tutorial notebook I put together.

Benefits of panda's multiindex?

Tags:

Python

Pandas

Multi Index

Related

Recent Posts