Shuffle a pandas dataframe by groups
Assuming you want to shuffle by sampleID
. First df.groupby
, shuffle (import random
first), and then call pd.concat
:
import random
groups = [df for _, df in df.groupby('sampleID')]
random.shuffle(groups)
pd.concat(groups).reset_index(drop=True)
sampleID col1 col2
0 2 1 20
1 2 2 94
2 2 3 99
3 1 1 63
4 1 2 23
5 1 3 73
6 3 1 73
7 3 2 56
8 3 3 34
You reset the index with df.reset_index(drop=True)
, but it is an optional step.
I found this to be significantly faster than the accepted answer:
ids = df["sampleID"].unique()
random.shuffle(ids)
df = df.set_index("sampleID").loc[ids].reset_index()
for some reason the pd.concat
was the bottleneck in my usecase. Regardless this way you avoid the concatenation.