How to split data on balanced training set and test set on sklearn

You can use StratifiedShuffleSplit to create datasets featuring the same percentage of classes as the original one:

import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 3], [3, 7], [2, 4], [4, 8]])
y = np.array([0, 1, 0, 1])
stratSplit = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=42)
for train_idx, test_idx in stratSplit:
    X_train=X[train_idx]
    y_train=y[train_idx]

print(X_train)
# [[3 7]
#  [2 4]]
print(y_train)
# [1 0]

Although Christian's suggestion is correct, technically train_test_split should give you stratified results by using the stratify param.

So you could do:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)

The trick here is that it starts from version 0.17 in sklearn.

From the documentation about the parameter stratify:

stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the labels array. New in version 0.17: stratify splitting

How to split data on balanced training set and test set on sklearn

Tags:

Machine Learning

Svm

Scikit Learn

Cross Validation

Related

Recent Posts