Hyper-parameter tuning using pure ranger package in R
Note that mlr
per default disables the internal parallelization of ranger. Set hyperparameter num.threads
to the number of cores available to speed mlr
up:
learner <- makeLearner("classif.ranger", num.threads = 4)
Alternatively, start a parallel backend via
parallelStartMulticore(4) # linux/osx
parallelStartSocket(4) # windows
before calling tuneParams
to parallelize the tuning.
I think there are at least two errors:
First, the function ranger
does not have a parameter called training_data
. Your error message Error in ranger(Species ~ ., training_data = iris, num.trees = 200) : unused argument (training_data = iris)
refers to that. You can see that when you look at ?ranger
or args(ranger)
.
Second, the function csrf
, on the other hand, has training_data
as input, but also requires test_data
. Most importantly, these two arguments do not have any defaults, implying that you must provide them. The following works without problems:
fit.rf = ranger(
Species ~ ., data = iris,
num.trees = 200
)
fit.rf.tune = csrf(
Species ~ .,
training_data = iris,
test_data = iris,
params1 = list(num.trees = 25, mtry=4),
params2 = list(num.trees = 50, mtry=4)
)
Here, I have just provided iris
as both training and test dataset. You would obviously not want to do that in your real application. Moreover, note that ranger
also take num.trees
and mtry
as input, so you could try tuning it there.
Another way to tune the model is to create a manual grid, maybe there are better ways to train the model but this may be a different option.
hyper_grid <- expand.grid(
mtry = 1:4,
node_size = 1:3,
num.trees = seq(50,500,50),
OOB_RMSE = 0
)
system.time(
for(i in 1:nrow(hyper_grid)) {
# train model
rf <- ranger(
formula = Species ~ .,
data = iris,
num.trees = hyper_grid$num.trees[i],
mtry = hyper_grid$mtry[i],
min.node.size = hyper_grid$node_size[i],
importance = 'impurity')
# add OOB error to grid
hyper_grid$OOB_RMSE[i] <- sqrt(rf$prediction.error)
})
user system elapsed
3.17 0.19 1.36
nrow(hyper_grid) # 120 models
position = which.min(hyper_grid$OOB_RMSE)
head(hyper_grid[order(hyper_grid$OOB_RMSE),],5)
mtry node_size num.trees OOB_RMSE
6 2 2 50 0.1825741858
23 3 3 100 0.1825741858
3 3 1 50 0.2000000000
11 3 3 50 0.2000000000
14 2 1 100 0.2000000000
# fit best model
rf.model <- ranger(Species ~ .,data = iris, num.trees = hyper_grid$num.trees[position], importance = 'impurity', probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position])
rf.model
Ranger result
Call:
ranger(Species ~ ., data = iris, num.trees = hyper_grid$num.trees[position], importance = "impurity", probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position])
Type: Classification
Number of trees: 50
Sample size: 150
Number of independent variables: 4
Mtry: 2
Target node size: 2
Variable importance mode: impurity
Splitrule: gini
OOB prediction error: 5.33 %
I hope it serves you.
To answer my (unclear) question, apparently ranger has no built-in CV/GridSearch functionality. However, here's how you do hyper-parameter tuning with ranger (via a grid search) outside of caret. Thanks goes to Marvin Wright (the maintainer of ranger) for the code. Turns out caret CV with ranger was slow for me because I was using the formula interface (which should be avoided).
ptm <- proc.time()
library(ranger)
library(mlr)
# Define task and learner
task <- makeClassifTask(id = "iris",
data = iris,
target = "Species")
learner <- makeLearner("classif.ranger")
# Choose resampling strategy and define grid
rdesc <- makeResampleDesc("CV", iters = 5)
ps <- makeParamSet(makeIntegerParam("mtry", 3, 4),
makeDiscreteParam("num.trees", 200))
# Tune
res = tuneParams(learner, task, rdesc, par.set = ps,
control = makeTuneControlGrid())
# Train on entire dataset (using best hyperparameters)
lrn = setHyperPars(makeLearner("classif.ranger"), par.vals = res$x)
m = train(lrn, iris.task)
print(m)
print(proc.time() - ptm) # ~6 seconds
For the curious, the caret equivalent is
ptm <- proc.time()
library(caret)
data(iris)
grid <- expand.grid(mtry = c(3,4))
fitControl <- trainControl(method = "CV",
number = 5,
verboseIter = TRUE)
fit = train(
x = iris[ , names(iris) != 'Species'],
y = iris[ , names(iris) == 'Species'],
method = 'ranger',
num.trees = 200,
tuneGrid = grid,
trControl = fitControl
)
print(fit)
print(proc.time() - ptm) # ~2.4 seconds
Overall, caret is the fastest way to do a grid search with ranger if one uses the non-formula interface.