Partitioning data set in r based on multiple classes of observations

this may be longer but i think it's more intuitive and can be done in base R ;)

# create the data frame you've described
x <-
    data.frame(
        cl = 
            c( 
                rep( 'A' , 100 ) ,
                rep( 'B' , 100 ) ,
                rep( 'C' , 100 ) ,
                rep( 'D' , 100 ) 
            ) ,

        othernum1 = rnorm( 400 ) ,
        othernum2 = rnorm( 400 ) ,
        othernum3 = rnorm( 400 ) ,
        othernum4 = rnorm( 400 ) ,
        othernum5 = rnorm( 400 ) ,
        othernum6 = rnorm( 400 ) ,
        othernum7 = rnorm( 400 ) 
    )

# sample 67 training rows within classification groups
training.rows <-
    tapply( 
        # numeric vector containing the numbers
        # 1 to nrow( x )
        1:nrow( x ) , 

        # break the sample function out by
        # the classification variable
        x$cl , 

        # use the sample function within
        # each classification variable group
        sample , 

        # send the size = 67 parameter
        # through to the sample() function
        size = 67 
    )

# convert your list back to a numeric vector
tr <- unlist( training.rows )

# split your original data frame into two:

# all the records sampled as training rows
training.df <- x[ tr , ]

# all other records (NOT sampled as training rows)
testing.df <- x[ -tr , ]

There is actually a nice package caret for dealing with machine learning problems and it contains a function createDataPartition() that pretty much does this sampling 2/3rds from each level of a supplied factor:

#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]

Partitioning data set in r based on multiple classes of observations

Tags:

Random

R

Partitioning

Related

Recent Posts