scikit learn: train_test_split, can I ensure same splits on different datasets

Since sklearn.model_selection.train_test_split(*arrays, **options) accepts a variable number of arguments, you can just do like this:

A_train, A_test, B_train, B_test, _, _ =  train_test_split(A, B, y, 
                                                           test_size=0.33,
                                                           random_state=42)

Yes, random state is enough.

>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X2 = np.hstack((X,X))
>>> X_train, X_test, _, _ = train_test_split(X,y, test_size=0.33, random_state=42)
>>> X_train2, X_test2, _, _ = train_test_split(X2,y, test_size=0.33, random_state=42)
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> X_train2
array([[4, 5, 4, 5],
       [0, 1, 0, 1],
       [6, 7, 6, 7]])
>>> X_test
array([[2, 3],
       [8, 9]])
>>> X_test2
array([[2, 3, 2, 3],
       [8, 9, 8, 9]])

Looking at the code for the train_test_split function, it sets the random seed inside the function at every call. So it will result in the same split every time. We can check that this works pretty simply

X1 = np.random.random((200, 5))
X2 = np.random.random((200, 5))
y = np.arange(200)

X1_train, X1_test, y1_train, y1_test = model_selection.train_test_split(X1, y,
                                                                        test_size=0.1,
                                                                        random_state=42)
X2_train, X2_test, y2_train, y2_test = model_selection.train_test_split(X1, y,
                                                                        test_size=0.1,
                                                                        random_state=42)

print np.all(y1_train == y2_train)
print np.all(y1_test == y2_test)

Which outputs:

True
True

Which is good! Another way of doing this problem is to create one training and test split on all your features and then split your features up before training. However if you're in a weird situation where you need to do both at once (sometimes with similarity matrices you don't want test features in your training set), then you can use the StratifiedShuffleSplit function to return the indices of the data that belongs to each set. For example:

n_splits = 1 
sss = model_selection.StratifiedShuffleSplit(n_splits=n_splits, 
                                             test_size=0.1,
                                             random_state=42)
train_idx, test_idx = list(sss.split(X, y))[0]

Tags:

Scikit Learn