Parameter "stratify" from method "train_test_split" (scikit Learn)
This stratify
parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter stratify
.
For example, if variable y
is a binary categorical variable with values 0
and 1
and there are 25% of zeros and 75% of ones, stratify=y
will make sure that your random split has 25% of 0
's and 75% of 1
's.
For my future self who comes here via Google:
train_test_split
is now in model_selection
, hence:
from sklearn.model_selection import train_test_split
# given:
# features: xs
# ground truth: ys
x_train, x_test, y_train, y_test = train_test_split(xs, ys,
test_size=0.33,
random_state=0,
stratify=ys)
is the way to use it. Setting the random_state
is desirable for reproducibility.