What is the difference between NaN and None?

NaN is used as a placeholder for missing data consistently in pandas, consistency is good. I usually read/translate NaN as "missing". Also see the 'working with missing data' section in the docs.

Wes writes in the docs 'choice of NA-representation':

After years of production use [NaN] has proven, at least in my opinion, to be the best decision given the state of affairs in NumPy and Python in general. The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions isnull and notnull which can be used across the dtypes to detect NA values.
...
Thus, I have chosen the Pythonic “practicality beats purity” approach and traded integer NA capability for a much simpler approach of using a special value in float and object arrays to denote NA, and promoting integer arrays to floating when NAs must be introduced.

Note: the "gotcha" that integer Series containing missing data are upcast to floats.

In my opinion the main reason to use NaN (over None) is that it can be stored with numpy's float64 dtype, rather than the less efficient object dtype, see NA type promotions.

#  without forcing dtype it changes None to NaN!
s_bad = pd.Series([1, None], dtype=object)
s_good = pd.Series([1, np.nan])

In [13]: s_bad.dtype
Out[13]: dtype('O')

In [14]: s_good.dtype
Out[14]: dtype('float64')

Jeff comments (below) on this:

np.nan allows for vectorized operations; its a float value, while None, by definition, forces object type, which basically disables all efficiency in numpy.

So repeat 3 times fast: object==bad, float==good

Saying that, many operations may still work just as well with None vs NaN (but perhaps are not supported i.e. they may sometimes give surprising results):

In [15]: s_bad.sum()
Out[15]: 1

In [16]: s_good.sum()
Out[16]: 1.0

To answer the second question:
You should be using pd.isnull and pd.notnull to test for missing data (NaN).

NaN can be used as a numerical value on mathematical operations, while None cannot (or at least shouldn't).

NaN is a numeric value, as defined in IEEE 754 floating-point standard. None is an internal Python type (NoneType) and would be more like "inexistent" or "empty" than "numerically invalid" in this context.

The main "symptom" of that is that, if you perform, say, an average or a sum on an array containing NaN, even a single one, you get NaN as a result...

In the other hand, you cannot perform mathematical operations using None as operand.

So, depending on the case, you could use None as a way to tell your algorithm not to consider invalid or inexistent values on computations. That would mean the algorithm should test each value to see if it is None.

Numpy has some functions to avoid NaN values to contaminate your results, such as nansum and nan_to_num for example.

The function isnan() checks to see if something is "Not A Number" and will return whether or not a variable is a number, for example isnan(2) would return false

The conditional myVar is not None returns whether or not the variable is defined

Your numpy array uses isnan() because it is intended to be an array of numbers and it initializes all elements of the array to NaN these elements are considered "empty"

What is the difference between NaN and None?

Tags:

Python

Pandas

Numpy

Nan

Related

Recent Posts