Warning message: "missing values in resampled performance measures" in caret train() using rpart
Not definitively sure without more data.
If this is regression, the most likely case is that the tree did not find a good split and used the average of the outcome as the predictor. That's fine but you cannot calculate R^2 since the variance of the predictions is zero.
If classification, it's hard to say. You could have a resample where one of the outcome classes has zero samples so sensitivity or specificity is undefined and thus NA
.
The Problem
The problem is that the rpart is using a tree based algorithm, which can only handle a limited number of factors in a given feature. So you may have a variable that has been set to a factor with more than 53 categories:
> rf.1 <- randomForest(x = rf.train.2,
+ y = rf.label,
+ ntree = 1000)
Error in randomForest.default(x = rf.train.2, y = rf.label, ntree = 1000) :
Can not handle categorical predictors with more than 53 categories.
At the base of your problem, caret is running that function, so make sure you fix up your categorical variables with more than 53 levels.
Here is where my problem lied before (notice zipcode coming in as a factor):
# ------------------------------- #
# RANDOM FOREST WITH CV 10 FOLDS #
# ------------------------------- #
rf.train.2 <- df_train[, c("v1",
"v2",
"v3",
"v4",
"v5",
"v6",
"v7",
"v8",
"zipcode",
"price",
"made_purchase")]
rf.train.2 <- data.frame(v1=as.factor(rf.train.2$v1),
v2=as.factor(rf.train.2$v2),
v3=as.factor(rf.train.2$v3),
v4=as.factor(rf.train.2$v4),
v5=as.factor(rf.train.2$v5),
v6=as.factor(rf.train.2$v6),
v7=as.factor(rf.train.2$v7),
v8=as.factor(rf.train.2$v8),
zipcode=as.factor(rf.train.2$zipcode),
price=rf.train.2$price,
made_purchase=as.factor(rf.train.2$made_purchase))
rf.label <- rf.train.2[,"made_purchase"]
The Solution
Remove all categorical variables that have more than 53 levels.
Here is my fixed up code, adjusting the categorical variable zipcode, you could even have wrapped it in a numeric wrapper like this: as.numeric(rf.train.2$zipcode)
.
# ------------------------------- #
# RANDOM FOREST WITH CV 10 FOLDS #
# ------------------------------- #
rf.train.2 <- df_train[, c("v1",
"v2",
"v3",
"v4",
"v5",
"v6",
"v7",
"v8",
"zipcode",
"price",
"made_purchase")]
rf.train.2 <- data.frame(v1=as.factor(rf.train.2$v1),
v2=as.factor(rf.train.2$v2),
v3=as.factor(rf.train.2$v3),
v4=as.factor(rf.train.2$v4),
v5=as.factor(rf.train.2$v5),
v6=as.factor(rf.train.2$v6),
v7=as.factor(rf.train.2$v7),
v8=as.factor(rf.train.2$v8),
zipcode=rf.train.2$zipcode,
price=rf.train.2$price,
made_purchase=as.factor(rf.train.2$made_purchase))
rf.label <- rf.train.2[,"made_purchase"]
This error happens when the model didn't converge in some cross-validation folds the predictions get zero variance. As a result, the metrics like RMSE or Rsquared can't be calculated so they become NAs. Sometimes there are parameters you can tune for better convergence, e.g. the neuralnet library offers to increase threshold which almost always leads to convergence. Yet, I'm not sure about the rpart library.
Another reason for this to happen is that you have already NAs in your training data. Then the obvious cure is to remove them before passing them by train(data = na.omit(training.data)).
Hope that enlightens a bit.