R split data into 2 parts randomly

I am building on the answer by ExperimenteR, which appears robust. One issue however is that the sample function is a bit weird in that it uses probabilities, which are not completely deterministic. Take this for example:

>sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.7, 0.3))

You would expect that the number of TRUE and FALSE values to be exactly 70 and 30, respectively. Oftentimes, this is not the case:

>table(sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.7, 0.3)))

 FALSE  TRUE 
    34    66 

Which is alright if you're not looking to be super precise. But if you would like exactly 70% and 30%, then do this instead:

v <- as.vector(c(rep(TRUE,70),rep(FALSE,30))) #create 70 TRUE, 30 FALSE
ind <- sample(v) #Sample them randomly. 
data1 <- data[ind, ] 
data2 <- data[!ind, ] 

Try

n <- 100
data <- data.frame(x=runif(n), y=rnorm(n))
ind <- sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.7, 0.3))
data1 <- data[ind, ]
data2 <- data[!ind, ]

Tags:

Split

Random

R