Random row selection in Pandas dataframe
sample
As of v0.20.0, you can use pd.DataFrame.sample
, which can be used to return a random sample of a fixed number rows, or a percentage of rows:
df = df.sample(n=k) # k rows
df = df.sample(frac=k) # int(len(df.index) * k) rows
For reproducibility, you can specify an integer random_state
, equivalent to using np.ramdom.seed
. So, instead of setting, for example, np.random.seed = 0
, you can:
df = df.sample(n=k, random_state=0)
Something like this?
import random
def some(x, n):
return x.ix[random.sample(x.index, n)]
Note: As of Pandas v0.20.0, ix
has been deprecated in favour of loc
for label based indexing.
The best way to do this is with the sample function from the random module,
import numpy as np
import pandas as pd
from random import sample
# given data frame df
# create random index
rindex = np.array(sample(xrange(len(df)), 10))
# get 10 random rows from df
dfr = df.ix[rindex]
With pandas version 0.16.1
and up, there is now a DataFrame.sample
method built-in:
import pandas
df = pandas.DataFrame(pandas.np.random.random(100))
# Randomly sample 70% of your dataframe
df_percent = df.sample(frac=0.7)
# Randomly sample 7 elements from your dataframe
df_elements = df.sample(n=7)
For either approach above, you can get the rest of the rows by doing:
df_rest = df.loc[~df.index.isin(df_percent.index)]
Per Pedram
's comment, if you would like to get reproducible samples, pass the random_state
parameter.
df_percent = df.sample(frac=0.7, random_state=42)