Getting a random sample in Python dataframe by category
I have an imbalanced
dataset and I used the following code to balance
the dataset with 100 samples (rows) per each class (label) of the dataset with the duplicate.activity
is my classes. This code is used for oversampling
instances of the minority class or undersampling
instances of the majority class. It should be used only on the training set.
balanced_df=Pdf_train.groupby('activity',as_index = False,group_keys=False).apply(lambda s: s.sample(100,replace=True))
You have to tell pandas you want to group by category with the groupby
method.
df.groupby('category')['item'].apply(lambda s: s.sample(10))
If you have less than ten items in a sample but don't want to sample with replacement you can do this.
df.groupby('category')['item'].apply(lambda s: s.sample(min(len(s), 10)))