Python: Random selection per group

There are two ways to do this very simply, one without using anything except basic pandas syntax:

df[['x','y']].groupby('x').agg(pd.DataFrame.sample)

This takes 14.4ms with 50k row dataset.

The other, slightly faster method, involves numpy.

df[['x','y']].groupby('x').agg(np.random.choice)

This takes 10.9ms with (the same) 50k row dataset.

Generally speaking, when using pandas, it's preferable to stick with its native syntax. Especially for beginners.


Using groupby and random.choice in an elegant one liner:

df.groupby('Group_Id').apply(lambda x :x.iloc[random.choice(range(0,len(x)))])

size = 2        # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df.groupby('Group_Id', as_index=False).apply(fn)

From 0.16.x onwards pd.DataFrame.sample provides a way to return a random sample of items from an axis of object.

In [664]: df.groupby('Group_Id').apply(lambda x: x.sample(1)).reset_index(drop=True)
Out[664]:
  Name  Group_Id
0  ABC         1
1  XYZ         2
2  DEF         3