NumPy or Pandas: Keeping array type as integer while having a NaN value
NaN
can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support )
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN
as much as you want. If you really want to have integers, depending on your application, you can use -1
, or 0
, or 1234567890
, or some other dedicated value to represent NaN
.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts
in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
This capability has been added to pandas (beginning with version 0.24): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
At this point, it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lowercase).