How can I speed up the training of my random forest?
The randomForest()
function can accept data using either the "formula interface" or the "matrix interface". The matrix interface is known to deliver much better performance figures.
Formula interface:
rf.formula = randomForest(Species ~ ., data = iris)
Matrix interface:
rf.matrix = randomForest(y = iris[, 5], x = iris[, 1:4])
While I'm a fan of brute force techniques, such as parallelization or running a code for an extremely long time, I am an even bigger fan of improving an algorithm to avoid having to use a brute force technique.
While training your random forest using 2000 trees was starting to get prohibitively expensive, training with a smaller number of trees took a more reasonable time. For starters, you can train with say 4
, 8
, 16
, 32
, ...
, 256
, 512
trees and carefully observe metrics which let you know how robust the model is. These metrics include things like the best constant model (how well your forest performs on the data set versus a model which predicts the median for all inputs), as well as the out-of-bag error. In addition, you can observe the top predictors and their importance, and whether you start to see a convergence there as you add more trees.
Ideally, you should not have to use thousands of trees to build a model. Once your model begins to converge, adding more trees won't necessarily worsen the model, but at the same time it won't add any new information. By avoiding using too many trees you may be able to cut down a calculation which would have taken on the order of a week to less than a day. If, on top of this, you leverage a dozen CPU cores, then you might be looking at something on the order of hours.
To look at variable importance after each random forest run, you can try something along the lines of the following:
fit <- randomForest(...)
round(importance(fit), 2)
It is my understanding that the first say 5-10 predictors have the greatest impact on the model. If you notice that by increasing trees these top predictors don't really change position relative to each other, and the importance metrics seem to stay the same, then you might want to consider not using so many trees.