Missing values in scikits machine learning

Missing values are simply not supported in scikit-learn. There has been discussion on the mailing list about this before, but no attempt to actually write code to handle them.

Whatever you do, don't use NaN to encode missing values, since many of the algorithms refuse to handle samples containing NaNs.

The above answer is outdated; the latest release of scikit-learn has a class Imputer that does simple, per-feature missing value imputation. You can feed it arrays containing NaNs to have those replaced by the mean, median or mode of the corresponding feature.


I wish I could provide a simple example, but I have found that RandomForestRegressor does not handle NaN's gracefully. Performance gets steadily worse when adding features with increasing percentages of NaN's. Features that have "too many" NaN's are completely ignored, even when the nan's indicate very useful information.

This is because the algorithm will never create a split on the decision "isnan" or "ismissing". The algorithm will ignore a feature at a particular level of the tree if that feature has a single NaN in that subset of samples. But, at lower levels of the tree, when sample sizes are smaller, it becomes more likely that a subset of samples won't have a NaN in a particular feature's values, and a split can occur on that feature.

I have tried various imputation techniques to deal with the problem (replace with mean/median, predict missing values using a different model, etc.), but the results were mixed.

Instead, this is my solution: replace NaN's with a single, obviously out-of-range value (like -1.0). This enables the tree to split on the criteria "unknown-value vs known-value". However, there is a strange side-effect of using such out-of-range values: known values near the out-of-range value could get lumped together with the out-of-range value when the algorithm tries to find a good place to split. For example, known 0's could get lumped with the -1's used to replace the NaN's. So your model could change depending on if your out-of-range value is less than the minimum or if it's greater than the maximum (it could get lumped in with the minimum value or maximum value, respectively). This may or may not help the generalization of the technique, the outcome will depend on how similar in behavior minimum- or maximum-value samples are to NaN-value samples.