Randomly split a numpy array
The error is that randint
is giving some repeated indices. You can test it by printing len(set(ind))
and you will see it is smaller than 5000.
To use the same idea, simply replace the first line with
ind = np.random.choice(range(input_matrix.shape[0]), size=(5000,), replace=False)
That being said, the second line of your code is pretty slow because of the iteration over the list. It would be much faster to define the indices you want with a vector of booleans, which would allow you to use the negation operator ~
.
choice = np.random.choice(range(matrix.shape[0]), size=(5000,), replace=False)
ind = np.zeros(matrix.shape[0], dtype=bool)
ind[choice] = True
rest = ~ind
On my machine, this method is exactly as fast as implementing scikit.learn's train_test_split
, which makes me think that the two are doing exactly the same thing.
One way may be to try using train_test_split
from sklearn
documentation:
import numpy as np
from sklearn.model_selection import train_test_split
# creating matrix
input_matrix = np.arange(46928*28*28).reshape((46928,28,28))
print('Input shape: ', input_matrix.shape)
# splitting into two matrices of second matrix by size
second_size = 5000/46928
X1, X2 = train_test_split(input_matrix, test_size=second_size)
print('X1 shape: ', X1.shape)
print('X2 shape: ', X2.shape)
Result:
Input shape: (46928, 28, 28)
X1 shape: (41928, 28, 28)
X2 shape: (5000, 28, 28)