Pandas : balancing data

This method get randomly k elements of each class.

def sampling_k_elements(group, k=3):
    if len(group) < k:
        return group
    return group.sample(k)

balanced = df.groupby('class').apply(sampling_k_elements).reset_index(drop=True)

"The following code works for undersampling of unbalanced classes but it's too much sorry for that.Try it! And also it works the same for upsampling problems! Good Luck!"

Import required sampling libraries

Click to copy

from sklearn.utils import resample

Define the majority and minority class

Click to copy

 df_minority9 = df[df['class']=='c9']
    df_majority1 = df[df['class']=='c1']
    df_majority2 = df[df['class']=='c2']
    df_majority3 = df[df['class']=='c3']
    df_majority4 = df[df['class']=='c4']
    df_majority5 = df[df['class']=='c5']
    df_majority6 = df[df['class']=='c6']
    df_majority7 = df[df['class']=='c7']
    df_majority8 = df[df['class']=='c8']

Unndersample majority class

Click to copy

 maj_class1 = resample(df_majority1, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class2 = resample(df_majority2, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class3 = resample(df_majority3, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class4 = resample(df_majority4, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class5 = resample(df_majority5, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class6 = resample(df_majority6, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class7 = resample(df_majority7, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class8 = resample(df_majority8, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123)

Combine minority class with undersampled majority class

Click to copy

df=pd.concat([df_minority9,maj_class1,maj_class2,maj_class3,maj_class4, maj_class5,dmaj_class6,maj_class7,maj_class8])

Display new balanced class counts

Click to copy

 df['class'].value_counts()

The above answer is correct but I would love to specify that the g above is not a Pandas DataFrame object which the user most likely wants. It is a pandas.core.groupby.groupby.DataFrameGroupBy object. Pandas apply does not modify the dataframe inplace but returns a dataframe. To see this, try calling head on g and the result will be as shown below.

Click to copy

import pandas as pd
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
     'val': [1,2,1,1,2,1,1,2,3,3]
    }

d = pd.DataFrame(d)
g = d.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
g.head()
>>> class val
0    c1    1
1    c2    2
2    c1    1
3    c1    1
4    c2    2
5    c1    1
6    c1    1
7    c2    2
8    c3    3
9    c3    3

To fix this, you can either create a new variable or assign g to the result of the apply as shown below so that you get a Pandas DataFrame:

Click to copy

g = d.groupby('class')
g = pd.DataFrame(g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)))

Calling the head now yields:

Click to copy

g.head()

>>>class val
0   c1   1
1   c2   2
2   c1   1
3   c1   1
4   c2   2

Which is most likely what the user wants.

Click to copy

g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))

  class  val
0    c1    1
1    c1    1
2    c2    2
3    c2    2
4    c3    3
5    c3    3

Answers to your follow-up questions

The x in the lambda ends up being a dataframe that is the subset of df represented by the group. Each of these dataframes, one for each group, gets passed through this lambda.
g is the groupby object. I placed it in a named variable because I planned on using it twice. df.groupby('class').size() is an alternative way to do df['class'].value_counts() but since I was going to groupby anyway, I might as well reuse the same groupby, use a size to get the value counts... saves time.
Those numbers are the the index values from df that go with the sampling. I added reset_index(drop=True) to get rid of it.

Pandas : balancing data

Import required sampling libraries

Define the majority and minority class

Unndersample majority class

Combine minority class with undersampled majority class

Display new balanced class counts

Tags:

Python

Pandas

Related

Recent Posts