Sample rows of pandas dataframe in proportion to counts in a column
the following sample a total of N row where each group appear in its original proportion to the nearest integer, then shuffle and reset the index using:
df = pd.DataFrame(dict(
A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
B=range(20)
))
Short and sweet:
df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)
Long version
df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)
I was looking for similar solution. The code provided by @Vaishali works absolutely fine. What @Abdou's trying to do also makes sense when we want to extract samples from each group based on their proportions to the full data.
# original : 10% from each group
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))
# modified : sample size based on proportions of group size
n = df.shape[0]
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=length(x)/n))
You can use groupby and sample
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))