Python - SkLearn Imputer usage

Since you say you want to replace these 'na' by a the mean of the column, I'm guessing the non-missing values are indeed floats. The problem is that pandas does not recognize the string 'na' as a missing value, and so reads the column with dtype object instead of some flavor of float.

Case in point, consider the following .csv file:

 test.csv

 col1,col2
 1.0,1.0
 2.0,2.0
 3.0,3.0
 na,4.0
 5.0,5.0

With the naive import df = pd.read_csv('test.csv'), df.dtypes tells us that col1 is of dtype object and col2 is of dtype float64. But how do you take the mean of a bunch of objects?

The solution is to tell pd.read_csv() to interpret the string 'na' as a missing value:

df = pd.read_csv('test.csv', na_values='na')

The resulting dataframe has both columns of dtype float64, and you can now use your imputer.

Python - SkLearn Imputer usage

Tags:

Python

Imputation

Scikit Learn

Related

Recent Posts