Python - SkLearn Imputer usage
Since you say you want to replace these 'na'
by a the mean of the column, I'm guessing the non-missing values are indeed floats. The problem is that pandas does not recognize the string 'na'
as a missing value, and so reads the column with dtype object
instead of some flavor of float
.
Case in point, consider the following .csv
file:
test.csv
col1,col2
1.0,1.0
2.0,2.0
3.0,3.0
na,4.0
5.0,5.0
With the naive import df = pd.read_csv('test.csv')
, df.dtypes
tells us that col1
is of dtype object
and col2
is of dtype float64
. But how do you take the mean of a bunch of objects?
The solution is to tell pd.read_csv()
to interpret the string 'na'
as a missing value:
df = pd.read_csv('test.csv', na_values='na')
The resulting dataframe has both columns of dtype float64
, and you can now use your imputer.