Getting a random sample in Python dataframe by category

I have an imbalanced dataset and I used the following code to balance the dataset with 100 samples (rows) per each class (label) of the dataset with the duplicate.activity is my classes. This code is used for oversampling instances of the minority class or undersampling instances of the majority class. It should be used only on the training set.

balanced_df=Pdf_train.groupby('activity',as_index = False,group_keys=False).apply(lambda s: s.sample(100,replace=True))

You have to tell pandas you want to group by category with the groupby method.

df.groupby('category')['item'].apply(lambda s: s.sample(10))

If you have less than ten items in a sample but don't want to sample with replacement you can do this.

df.groupby('category')['item'].apply(lambda s: s.sample(min(len(s), 10)))

Getting a random sample in Python dataframe by category

Tags:

Pandas

Python 3.X

Related

Recent Posts