How to handle missing NaNs for machine learning in python

What is the best way to handle missing values in data set?

There is NO best way, each solution/algorithm has their own pros and cons (and you can even mix some of them together to create your own strategy and tune the related parameters to come up one best satisfy your data, there are many research/papers about this topic).

For example, Mean Imputation is quick and simple, but it would underestimate the variance and the distribution shape is distorted by replacing NaN with the mean value, while KNN Imputation might not be ideal in a large data set in terms of time complexity, since it iterate over all the data points and perform calculation for each NaN value, and the assumption is that NaN attribute is correlated with other attributes.

How to handle missing values in datasets before applying machine learning algorithm??

In addition to mean imputation you mention, you could also take a look at K-Nearest Neighbor Imputation and Regression Imputation, and refer to the powerful Imputer class in scikit-learn to check existing APIs to use.

KNN Imputation

Calculate the mean of k nearest neighbors of this NaN point.

Regression Imputation

A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing.

Here links to scikit's 'Imputation of missing values' section. I have also heard of Orange library for imputation, but haven't had a chance to use it yet.


There's no single best way to deal with missing data. The most rigorous approach is to model the missing values as additional parameters in a probabilistic framework like PyMC. This way you'll get a distribution over possible values, instead of just a single answer. Here's an example of dealing with missing data using PyMC: http://stronginference.com/missing-data-imputation.html

If you really want to plug those holes with point estimates, then you're looking to perform "imputation". I'd steer away from simple imputation methods like mean-filling since they really butcher the joint distribution of your features. Instead, try something like softImpute (which tries you infer the missing value via low-rank approximation). The original version of softImpute is written for R but I've made a Python version (along with other methods like kNN imputation) here: https://github.com/hammerlab/fancyimpute