How to sample from DataFrame based on percentile of a column?
Let's try:
bins = [0, 0.1, 0.5, 1]
samples = [3,3,1]
df['sample'] = pd.cut(df.percent[::-1].cumsum(), # accumulate percentage
bins=[0, 0.1, 0.5, 1], # bins
labels=False # num samples
).astype(int)
df.groupby('sample').apply(lambda x: x.sample(n=samples[x['sample'].iloc[0])] )
Output:
key freq percent sample
sample
1 0 ABC 100 0.328947 1
2 2 GHI 50 0.164474 2
5 PQR 11 0.036184 2
4 7 VWX 10 0.032895 4
6 STU 10 0.032895 4
12 HAHA 1 0.003289 4
10 HOWEE 2 0.006579 4
See if this helps.
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
s = list(1 - df['percent'].cumsum())
s.pop(-1)
s.insert(0,1.0)
df['cum_lag'] = s
print(df[df['cum_lag'] > 0.5]['key'])