Shuffle DataFrame rows
The idiomatic way to do this with Pandas is to use the .sample
method of your data frame to sample all rows without replacement:
df.sample(frac=1)
The frac
keyword argument specifies the fraction of rows to return in the random sample, so frac=1
means to return all rows (in random order).
Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.
df = df.sample(frac=1).reset_index(drop=True)
Here, specifying drop=True
prevents .reset_index
from creating a column containing the old index entries.
Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old)
is not the same as id(df_new)
), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:
$ python3 -m memory_profiler .\test.py
Filename: .\test.py
Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)
You can simply use sklearn
for this
from sklearn.utils import shuffle
df = shuffle(df)
TL;DR: np.random.shuffle(ndarray)
can do the job.
So, in your case
np.random.shuffle(DataFrame.values)
DataFrame
, under the hood, uses NumPy ndarray as a data holder. (You can check from DataFrame source code)
So if you use np.random.shuffle()
, it would shuffle the array along the first axis of a multi-dimensional array. But the index of the DataFrame
remains unshuffled.
Though, there are some points to consider.
- function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
sklearn.utils.shuffle()
, as user tj89 suggested, can designaterandom_state
along with another option to control output. You may want that for dev purposes.sklearn.utils.shuffle()
is faster. But WILL SHUFFLE the axis info(index, column) of theDataFrame
along with thendarray
it contains.
Benchmark result
between sklearn.utils.shuffle()
and np.random.shuffle()
.
ndarray
nd = sklearn.utils.shuffle(nd)
0.10793248389381915 sec. 8x faster
np.random.shuffle(nd)
0.8897626010002568 sec
DataFrame
df = sklearn.utils.shuffle(df)
0.3183923360193148 sec. 3x faster
np.random.shuffle(df.values)
0.9357550159329548 sec
Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use
sklearn.utils.shuffle()
. Otherwise, usenp.random.shuffle()
used code
import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''
timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)
pythonbenchmarking
You can shuffle the rows of a data frame by indexing with a shuffled index. For this, you can eg use np.random.permutation
(but np.random.choice
is also a possibility):
In [12]: df = pd.read_csv(StringIO(s), sep="\s+")
In [13]: df
Out[13]:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
20 7 8 9 2
21 10 11 12 2
45 13 14 15 3
46 16 17 18 3
In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]:
Col1 Col2 Col3 Type
46 16 17 18 3
45 13 14 15 3
20 7 8 9 2
0 1 2 3 1
1 4 5 6 1
21 10 11 12 2
If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)