Count if in multiple index Dataframe
You can do value_counts
newdf=df.groupby(['City','Card','Colateral'])['% Debt Paid'].\
value_counts(bins=[-0.1,0,0.25,0.5,0.75,1,1.0001,999],normalize=True)
TL;DR
group_cols = ['City', 'Card', 'Colateral']
debt_col = '% Debt Paid'
# (1) Bin the data that is in non-zero-width intervals
bins = pd.IntervalIndex.from_breaks((0, 0.25, 0.5, 0.75, 1, np.inf),
closed='right')
ser_pt1 = df.groupby(group_cols, sort=False)[debt_col]\
.value_counts(bins=bins, sort=False, normalize=True)
# (2) Get the data from zero width intervals (0% and 100%)
ser_pt2 = df[df[debt_col].isin((0, 1))]\
.groupby(group_cols)[debt_col].value_counts()
# Take also "zero counts" and normalize
ser_pt2 = ser_pt2.reindex(
pd.MultiIndex.from_product(ser_pt2.index.levels,
names=ser_pt2.index.names),
fill_value=0) / df.groupby(group_cols)[debt_col].count()
# (3) Combine the results
ser_out = pd.concat([ser_pt1, ser_pt2])
Here's the quick-n-dirty answer. Below is a copy-pasteable full answer which also makes the index names and ordering as requested in the question.
1. Summary
The problem comes more difficult to solve since the bins you want are intersecting. That is, you want to have bin for ]75, 100]
and [100, 100]
, which both should include the case where % Debt Paid
is 1.0
. I would handle two cases separately
(1) Binning for values ]0, 25]%
, ]25, 50]%
, ... ,]100%, np.inf]%
(2) 0%
and 100%
2. Description of solution
2.1 Binned part
- The binned part is calculated using
gp[debt_col].value_counts
, which is essentially using pd.Series.value_counts sincegp
is a DataFrameGroupBy object andgp[debt_col]
is a SeriesGroupBy object. - The bins needed for the value_counts can be created easily from a list of endpoints using pd.IntervalIndex.from_breaks
- The
>100%
is also a bin, with right endpoint at infinity (np.inf
).
2.2 The rest (0% and 100%)
- Use the pd.Series.isin at
df[debt_col].isin((0, 1))
to select the0.0
and1.0
cases only, and then use value_counts to count the occurences of "0%" and "100%". - Then, we also need to include the cases where the count is zero. This can be done by reindexing. So, we use pd.Series.reindex to give a row for each ("City", "Card", "Colateral") combination, and form there combinations with pd.MultiIndex.from_product
- Lastly, we normalize the counts by dividing with the total counts in each group (
df.groupby(group_cols)[debt_col].count()
)
2.3 Renaming
- Our new index (level 3, called 'bin') is now ready, but to get the to same output as in the OP's question, we need to rename the index labels. This is done just looping over the values and using a "lookup dictionary" for new names
- The ordering of the labels in the index is by default taken from the numerical/alphabetical ordering but this is not what we want. To force the index order after sorting it, we must use pd.Categorical as the index. The order for sorting is given in the
categories
argument. We rely on the fact that in python 3.6+ dictionaries preserve ordering. - For some reason the
ser_out.sort_index()
did not work out even with a categorical index. I am thinking it might be a bug in the pandas. Therefore, the result Seriesser_out
is casted to a DataFramedf_out
, and the sorting is made using dataframe. - Lastly, the resulting dataframe is made MultiIndex with
set_index
.
Code
Zero-width bins cause the value_counts
to yield really bizarre results. Maybe this is a bug of pandas. Therefore, let's divide the problem into two steps (1) Count the data in the non-zero-width bins (2) Count the data in zero-width bins ("0%" and "100%")
import pandas as pd
import numpy as np
d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Tokyo','Lisbon','Lisbon','Lisbon','Tokyo','Lisbon','Tokyo','Tokyo','Tokyo','Lisbon','Tokyo','Tokyo','Lisbon','Lisbon','Lisbon','Tokyo','Lisbon','Tokyo'],
'Card': ['Visa','Visa','Master Card','Master Card','Visa','Master Card','Visa','Visa','Master Card','Visa','Master Card','Visa','Visa','Master Card','Master Card','Visa','Master Card','Visa','Visa','Master Card','Visa','Master Card'],
'Colateral':['Yes','No','Yes','No','No','No','No','Yes','Yes','No','Yes','Yes','No','Yes','No','No','No','Yes','Yes','No','No','No'],
'Client Number':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
'% Debt Paid':[0.8,0.1,0.5,0.30,0,0.2,0.4,1,0.60,1,0.5,0.2,0,0.3,0,0,0.2,0,0.1,0.70,0.5,0.1]}
df = pd.DataFrame(data=d)
def _get_binned_part(df, group_cols, debt_col):
bins = pd.IntervalIndex.from_breaks((0, 0.25, 0.5, 0.75, 1, np.inf),
closed='right')
gp = df[group_cols + [debt_col]].groupby(group_cols, sort=False)
ser_pt1 = gp[debt_col].value_counts(bins=bins, sort=False, normalize=True)
ser_pt1.index.set_names('bin', level=3, inplace=True)
return ser_pt1
def _get_non_binned_part(df, group_cols, debt_col):
# Count 0% and 100% occurences
ser_pt2 = df[df[debt_col].isin((0, 1))]\
.groupby(group_cols)[debt_col].value_counts()
# include zero counts
ser_pt2 = ser_pt2.reindex(pd.MultiIndex.from_product(
ser_pt2.index.levels, names=ser_pt2.index.names),
fill_value=0)
ser_pt2.index.set_names('bin', level=3, inplace=True)
# ser_counts has the counts for normalization.
ser_counts = df.groupby(group_cols)[debt_col].count()
ser_pt2 = ser_pt2 / ser_counts
return ser_pt2
def _rename_bins(ser_out, group_cols, debt_col):
bin_names = []
bin_name_dict = {
'0.0': '0%',
'(0.0, 0.25]': ']0, 25]%',
'(0.25, 0.5]': ']25, 50]%',
'(0.5, 0.75]': ']50, 75]%',
'(0.75, 1.0]': ']75, 100]%',
'1.0': '100%',
'(1.0, inf]': '>100%',
}
bin_order = list(bin_name_dict.values())
for val in ser_out.index.levels[3].values:
bin_names.append(bin_name_dict.get(val.__str__(), val.__str__()))
bin_categories = pd.Categorical(bin_names,
categories=bin_order,
ordered=True)
ser_out.index.set_levels(bin_categories, level=3, inplace=True)
# For some reason, .sort_index() does not sort correcly
# -> Make it a dataframe and sort there.
df_out = ser_out.reset_index()
df_out['bin'] = pd.Categorical(df_out['bin'].values,
bin_order,
ordered=True)
df_out = df_out.sort_values(group_cols + ['bin']).set_index(group_cols +
['bin'])
df_out.rename(columns={debt_col: 'in_bin'}, inplace=True)
df_out['in_bin'] = (df_out['in_bin'] * 100).round(2)
return df_out
def get_results(df):
group_cols = ['City', 'Card', 'Colateral']
debt_col = '% Debt Paid'
ser_pt1 = _get_binned_part(df, group_cols, debt_col)
ser_pt2 = _get_non_binned_part(df, group_cols, debt_col)
ser_out = pd.concat([ser_pt1, ser_pt2])
df_out = _rename_bins(ser_out, group_cols, debt_col)
return df_out
df_out = get_results(df)
Example output
In [1]: df_out
Out[1]:
in_bin
City Card Colateral bin
Lisbon Master Card No 0% 0.00
]0, 25]% 100.00
]25, 50]% 0.00
]50, 75]% 0.00
]75, 100]% 0.00
100% 0.00
>100% 0.00
Yes 0% 0.00
]0, 25]% 0.00
]25, 50]% 100.00
]50, 75]% 0.00
]75, 100]% 0.00
100% 0.00
>100% 0.00
Visa No 0% 0.00
]0, 25]% 0.00
]25, 50]% 66.67
]50, 75]% 0.00
]75, 100]% 33.33
100% 33.33
>100% 0.00
Yes 0% 33.33
]0, 25]% 33.33
]25, 50]% 0.00
]50, 75]% 0.00
]75, 100]% 33.33
100% 33.33
>100% 0.00
Tokyo Master Card No 0% 25.00
]0, 25]% 25.00
]25, 50]% 25.00
]50, 75]% 25.00
]75, 100]% 0.00
100% 0.00
>100% 0.00
Yes 0% 0.00
]0, 25]% 0.00
]25, 50]% 50.00
]50, 75]% 50.00
]75, 100]% 0.00
100% 0.00
>100% 0.00
Visa No 0% 75.00
]0, 25]% 25.00
]25, 50]% 0.00
]50, 75]% 0.00
]75, 100]% 0.00
100% 0.00
>100% 0.00
Yes 0% 0.00
]0, 25]% 50.00
]25, 50]% 0.00
]50, 75]% 0.00
]75, 100]% 50.00
100% 0.00
>100% 0.00
Appendix
Desired example output: "Lisbon, Visa, No"
With this combination
In [1]: df.loc[ (df['City'] == 'Lisbon') & (df['Card'] == 'Visa') & (df['Colateral'] == 'No')]
Out[1]:
City Card Colateral Client Number % Debt Paid
6 Lisbon Visa No 7 0.4
9 Lisbon Visa No 10 1.0
20 Lisbon Visa No 21 0.5
the output data table should have
0% 0%
]0, 25]% 0%
]25, 50]% 66.7%
]50, 75]% 0%
]75, 100]% 33.3%
100% 33.3%
>100% 0%
Note that the one intersecting bin pair (]75, 100]
and [100, 100]
) will cause the total sum of the ouput column to be sometimes greater than 100%.