Improving model training speed in caret (R)
What people forget when comparing the underlying model versus using caret is that caret has a lot of extra stuff going on.
Take for example your randomforest. so bootstrap, number 3, and tuneLength 5. So you resample 3 times, and because of the tuneLength you try to find a good value for mtry. In total you run 15 random forests and comparing these to get the best one for the final model, versus only 1 if you use the basic random forest model.
Also you are running parallel on 4 cores and randomforest needs all the observations available, so all your training observations will be 4 times in memory. Probably not much memory left for training the model.
My advice is to start scaling down to see if you can speed things up, like setting the bootstrap number to 1 and tune length back to the default 3. Or even setting the traincontrol method to "none", just to get an idea on how fast the model is on the minimal settings and no resampling.
@phiver hits the nail on the head but, for this situation, there are a few things to suggest:
- make sure that you are not exhausting your system memory by using parallel processing. You are making X extra copies of the data in memory when using X workers.
- with a class imbalance, additional sampling can help. Downsampling might help improve performance and take less time.
- use different libraries. ranger instead of randomForest, xgboost or C5.0 instead of gbm. You should realize that ensemble methods are fitting a ton of constituent models and a bound to take a while to fit.
- the package has a racing-type algorithm for tuning parameters in less time
- the development version on github has random search methods for the models with a lot of tuning parameters.
Max
Great inputs by @phiver and @topepo. I will try to summarize and add some more points that I gathered from the little bit of SO posts searching that I did for a similar problem:
- Yes, parallel processing takes more time, with lesser memory. With 8 cores and 64GB RAM, a rule of thumb could be to use 5-6 workers at best.
- @topepo's page on caret pre-processing here is fantastic. It is step-wise instructive and helps to replace the manual work of pre-processing such as dummy variables, removing multi-collinear /linear combination variables and transformation.
- One of the reasons the randomForest and other models become really slow is because of the number of factors in categorical variables. It is either advised to club factors or convert to ordinal/numeric transformation if possible.
- Try using the Tunegrid feature in caret to the fullest for the ensemble models. Start with least values of mtry/ntree for a sample of data and see how it works out in terms of improvement in accuracy improvement.
- I found out this SO page to be very useful where parRF is suggested primarily. I didn't a lot of improvement in my dataset by replacing RF with parRF but you can try out. The other suggestions there is to use data.table instead of dataframes and use predictor/response data instead of formula. It greatly improves the speed, believe me (But there is a caveat, the performance of predictor/response data (providing x=X, y=Y data.tables) also seems to somehow improve predictive accuracy somehow and change the Variable importance table from factor-wise break up while using formula (Y~.).