How to split data into trainset and testset randomly?
from sklearn.model_selection import train_test_split
import numpy
with open("datafile.txt", "rb") as f:
data = f.read().split('\n')
data = numpy.array(data) #convert array to numpy type array
x_train ,x_test = train_test_split(data,test_size=0.5) #test_size=0.5(whole_data)
To answer @desmond.carros question, I modified the best answer as follows,
import random
file=open("datafile.txt","r")
data=list()
for line in file:
data.append(line.split(#your preferred delimiter))
file.close()
random.shuffle(data)
train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
test_data = data[int((len(data)+1)*.80):] #Splits 20% data to test set
The code splits the entire dataset to 80% train and 20% test data
This can be done similarly in Python using lists, (note that the whole list is shuffled in place).
import random
with open("datafile.txt", "rb") as f:
data = f.read().split('\n')
random.shuffle(data)
train_data = data[:50]
test_data = data[50:]
You could also use numpy. When your data is stored in a numpy.ndarray:
import numpy as np
from random import sample
l = 100 #length of data
f = 50 #number of elements you need
indices = sample(range(l),f)
train_data = data[indices]
test_data = np.delete(data,indices)