How to use random forests in R with missing values?
My initial reaction to this question was that it didn't show much research effort, since "everyone" knows that random forests don't handle missing values in predictors. But upon checking ?randomForest
I must confess that it could be much more explicit about this.
(Although, Breiman's PDF linked to in the documentation does explicitly say that missing values are simply not handled at all.)
The only obvious clue in the official documentation that I could see was that the default value for the na.action
parameter is na.fail
, which might be too cryptic for new users.
In any case, if your predictors have missing values, you have (basically) two choices:
- Use a different tool (
rpart
handles missing values nicely.) - Impute the missing values
Not surprisingly, the randomForest
package has a function for doing just this, rfImpute
. The documentation at ?rfImpute
runs through a basic example of its use.
If only a small number of cases have missing values, you might also try setting na.action = na.omit
to simply drop those cases.
And of course, this answer is a bit of a guess that your problem really is simply having missing values.
Breiman's random forest, which the randomForest package is based on, actually does handle missing values in predictors. In the randomForest package, you can set
na.action = na.roughfix
It will start by using median/mode for missing values, but then it grows a forest and computes proximities, then iterate and construct a forest using these newly filled values etc. This is not well explained in the randomForest documentation (p10). It only states
....NAs are replaced with column medians .... This is used as a starting point for imputing missing values by random forest
On Breiman's homepage you find a little bit more information
missfill= 1,2 does a fast replacement of the missing values, for the training set (if equal to 1) and a more careful replacement (if equal to 2).
mfixrep= k with missfill=2 does a slower, but usually more effective, replacement using proximities with k iterations on the training set only. (Requires nprox >0).