Usage and convention differences between missing, nothing, undef, and NaN in Julia

TLDR:

  • If you're working in statistics, chances are that you want missing to signal the absence of a particular data in a collection.

  • If you want to define an array of floating-point numbers, but initialize individual elements later, you might want to use undef for performance reasons (to avoid spending time setting elements to a value, which will get overriden afterwards):

    Vector{Float64}(undef, n)
    

    In the same situation, but following an approach less oriented towards performance and more towards safety, you can also initialize all elements to NaN in order to take advantage of the propagating behavior of NaN to help identify bugs that could happen if you forget to set some value in the array:

    fill(NaN, n)
    
  • You'll probably encounter nothing in some part of Julia's API to signal cases where no meaningful value can be computed. But it is generally not used in arrays otherwise contaning numeric data (which seems to be your use case here)


Here is my take on the differences between these options:




missing is used to represent missing values in a statistical sense, i.e. values that theoretically exist, but that you don't know. missing is similar in spirit (and in behavior, in most cases) to NA in R. A defining feature of missing values is that you can use them in computations:

julia> x = 1       # x has a known value: 1
1

julia> y = missing # y has a value, but it is unknown
missing

julia> z = x * y   # no error: z has a value, that just happens to be unknown
missing            # (as a consequence of not knowing the value of y

One important characteristic of missing is that it has its own specific type: Missing. This means in particular that arrays containing missing values among other numeric values are not homoegeneous in type:

julia> [1, missing, 3]
3-element Array{Union{Missing, Int64},1}: # not Array{Int64, 1}
 1
 missing
 3

Note that, although the Julia compiler has become very good at handling such heterogeneous arrays for such small unions, there is an inherent performance issue with having elements of different types, as we can not know in advance what the type of an element will be.




nothing also has its own type: Nothing. In contrast to missing, it tends to be used for things that have no value. Which is why, in contrast to missing, computing with nothing does not make sense, and errors out:

julia> 3*nothing
ERROR: MethodError: no method matching *(::Int64, ::Nothing)

nothing is primarily used as the return value of functions that don't return anything, either because they only have side-effects, or because they could not compute any meaningful result:

julia> @show println("OK")           # Only side effects
OK
println("OK") = nothing

julia> @show findfirst('a', "Hello") # No meaningful result
findfirst('a', "Hello") = nothing

An other notable use of nothing is in function arguments or object fields for which a value is not always provided. This would typically be represented in the type system as a Union{MeaningfulType, Nothing}. For example, with the following definition of a binary tree structure, a leaf (which, by definition, is a node that has no children) would be represented as a node of which the children are nothing:

struct TreeNode
  child1 :: Union{TreeNode, Nothing}
  child2 :: Union{TreeNode, Nothing}
end

leaf = TreeNode(nothing, nothing)




Unlike the previous two, NaN does not have its own specific type: NaN is merely a specific value of the Float64 type (and NaN32 similarly exists for Float32). As you probably know, these values normally appear as the result of undefined operations (such as 0/0), and have a very special meaning in floating-point arithmetic, which makes them propagate (in more or less the same way as missing values). But apart from that arithmetic behavior, these are normal floating-point values. In particular, a vector of floating-point values may contain NaNs without it affecting its type:

julia> [1., NaN, 2.]
3-element Array{Float64,1}: # Note how this differs from the example with missing above
 1.0
 NaN
 2.0




undef is very different from everything that has been mentioned so far. It is not really a value (at least not in the sense of a number having a value), but rather a "flag" that one can pass to array constructors to tell Julia not to initialize the values in the array (generally for performance considerations). In the following example, the array elements will not be set to any specific value but, since there is no such thing as a number without any value in Julia, elements will have arbitrary values (coming from whatever happens to be in memory where the vector gets allocated).

julia> Vector{Float64}(undef, 3)
3-element Array{Float64,1}:
 6.94567437726575e-310
 6.94569509953624e-310
 6.94567437549977e-310

When elements are of more complex type (in technical words: non-isbits type) and a distinction can be made between initialized and uninitialized elements, Julia denotes the latter with #undef

julia> mutable struct Foo end
julia> Vector{Foo}(undef, 3)
3-element Array{Foo,1}:
 #undef
 #undef
 #undef

I would summarize the options as follows. What I write is from the perspective of "reading values", but this is also guidance when "writing values".

  1. nothing means "the value does not exist" (e.g. findfirst returns nothing when it does not find a value in a collection); it is of a separate type Nothing
  2. missing means "the value itself exists but we do not know it" (I would expect that normally you get missing in your data only if it is taken from an external source, e.g. you have a record of patient data and body temperature is missing (obviously it exists - simply it was not recorded); I do not think that any function from Base can return it except for the case it got missing as an argument); it is of a separate type Missing
  3. NaN - is just a numeric data (as opposed to nothing and missing); it signals the user if that result of some operation on numeric values returned NaN; from my experience this is the only case when NaN should appear in your data (e.g. a result of 0/0)
  4. undef is not a value you will see, it is only used in the form e.g. Vector{Int}(undef, 10) to create an array without initializing its values (so this is only a performance optimization); you should use it only if you immediately want to initialize the elements of the array with some values you plan to compute (and using undef will lead to #undef entries if the element type of the array is not bits type or a union of bits types; for bits types using undef to initialize an array just gives you some garbage)

Those are the standard rules. Now an exception is (and this is a typical practice in some other languages) that sometimes you might want to use NaN for signaling a missing or nothing in a collection. It is not something that is recommended but it has one benefit, which you can see in this example:

julia> x1 = [1.0, NaN]
2-element Array{Float64,1}:
   1.0
 NaN

julia> x2 = [1.0, missing]
2-element Array{Union{Missing, Float64},1}:
 1.0
  missing

And as you can see as NaN is a floating point value the element type of x1 array is just Float64, while in x2 array the element type is a Union. In some situations you might want to opt for x1 instead of x2 because it is a bit faster to perform operations against (checking e.g. for possibility of missing has some minimal overhead). But this is a performance optimization that should normally not be done, as other people when they read Julia code normally thing that NaN is a genuine NaN, not a placeholder that signals missing or nothing.

Tags:

Julia