classifiers in scikit-learn that handle nan/null

I made an example that contains both missing values in training and the test sets

I just picked a strategy to replace missing data with the mean, using the SimpleImputer class. There are other strategies.

from __future__ import print_function

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer


X_train = [[0, 0, np.nan], [np.nan, 1, 1]]
Y_train = [0, 1]
X_test_1 = [0, 0, np.nan]
X_test_2 = [0, np.nan, np.nan]
X_test_3 = [np.nan, 1, 1]

# Create our imputer to replace missing values with the mean e.g.
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(X_train)

# Impute our data, then train
X_train_imp = imp.transform(X_train)
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train_imp, Y_train)

for X_test in [X_test_1, X_test_2, X_test_3]:
    # Impute each test item, then predict
    X_test_imp = imp.transform(X_test)
    print(X_test, '->', clf.predict(X_test_imp))

# Results
[0, 0, nan] -> [0]
[0, nan, nan] -> [0]
[nan, 1, 1] -> [1]

If you are using DataFrame, you could use fillna. Here I replaced the missing data with the mean of that column.

df.fillna(df.mean(), inplace=True)

Short answer

Sometimes missing values are simply not applicable. Imputing them is meaningless. In these cases you should use a model that can handle missing values. Scitkit-learn's models cannot handle missing values. XGBoost can.

More on scikit-learn and XGBoost

As mentioned in this article, scikit-learn's decision trees and KNN algorithms are not (yet) robust enough to work with missing values. If imputation doesn't make sense, don't do it.

Consider situtations when imputation doesn't make sense.

keep in mind this is a made-up example

Consider a dataset with rows of cars ("Danho Diesel", "Estal Electric", "Hesproc Hybrid") and columns with their properties (Weight, Top speed, Acceleration, Power output, Sulfur Dioxide Emission, Range).

Electric cars do not produce exhaust fumes - so the Sulfur dioxide emission of the Estal Electric should be a NaN-value (missing). You could argue that it should be set to 0 - but electric cars cannot produce sulfur dioxide. Imputing the value will ruin your predictions.

As mentioned in this article, scikit-learn's decision trees and KNN algorithms are not (yet) robust enough to work with missing values. If imputation doesn't make sense, don't do it.

classifiers in scikit-learn that handle nan/null

Short answer

More on scikit-learn and XGBoost

Consider situtations when imputation doesn't make sense.

Tags:

Python

Pandas

Machine Learning

Nan

Scikit Learn

Related

Recent Posts