Usage and convention differences between missing, nothing, undef, and NaN in Julia
TLDR:
If you're working in statistics, chances are that you want
missing
to signal the absence of a particular data in a collection.If you want to define an array of floating-point numbers, but initialize individual elements later, you might want to use
undef
for performance reasons (to avoid spending time setting elements to a value, which will get overriden afterwards):Vector{Float64}(undef, n)
In the same situation, but following an approach less oriented towards performance and more towards safety, you can also initialize all elements to
NaN
in order to take advantage of the propagating behavior ofNaN
to help identify bugs that could happen if you forget to set some value in the array:fill(NaN, n)
You'll probably encounter
nothing
in some part of Julia's API to signal cases where no meaningful value can be computed. But it is generally not used in arrays otherwise contaning numeric data (which seems to be your use case here)
Here is my take on the differences between these options:
missing
is used to represent missing values in a statistical sense, i.e. values that theoretically exist, but that you don't know. missing
is similar in spirit (and in behavior, in most cases) to NA
in R. A defining feature of missing
values is that you can use them in computations:
julia> x = 1 # x has a known value: 1
1
julia> y = missing # y has a value, but it is unknown
missing
julia> z = x * y # no error: z has a value, that just happens to be unknown
missing # (as a consequence of not knowing the value of y
One important characteristic of missing
is that it has its own specific type: Missing
. This means in particular that arrays containing missing
values among other numeric values are not homoegeneous in type:
julia> [1, missing, 3]
3-element Array{Union{Missing, Int64},1}: # not Array{Int64, 1}
1
missing
3
Note that, although the Julia compiler has become very good at handling such heterogeneous arrays for such small unions, there is an inherent performance issue with having elements of different types, as we can not know in advance what the type of an element will be.
nothing
also has its own type: Nothing
. In contrast to missing
, it tends to be used for things that have no value. Which is why, in contrast to missing
, computing with nothing
does not make sense, and errors out:
julia> 3*nothing
ERROR: MethodError: no method matching *(::Int64, ::Nothing)
nothing
is primarily used as the return value of functions that don't return anything, either because they only have side-effects, or because they could not compute any meaningful result:
julia> @show println("OK") # Only side effects
OK
println("OK") = nothing
julia> @show findfirst('a', "Hello") # No meaningful result
findfirst('a', "Hello") = nothing
An other notable use of nothing
is in function arguments or object fields for which a value is not always provided. This would typically be represented in the type system as a Union{MeaningfulType, Nothing}
. For example, with the following definition of a binary tree structure, a leaf (which, by definition, is a node that has no children) would be represented as a node of which the children are nothing
:
struct TreeNode
child1 :: Union{TreeNode, Nothing}
child2 :: Union{TreeNode, Nothing}
end
leaf = TreeNode(nothing, nothing)
Unlike the previous two, NaN
does not have its own specific type: NaN
is merely a specific value of the Float64
type (and NaN32
similarly exists for Float32
). As you probably know, these values normally appear as the result of undefined operations (such as 0/0), and have a very special meaning in floating-point arithmetic, which makes them propagate (in more or less the same way as missing
values). But apart from that arithmetic behavior, these are normal floating-point values. In particular, a vector of floating-point values may contain NaN
s without it affecting its type:
julia> [1., NaN, 2.]
3-element Array{Float64,1}: # Note how this differs from the example with missing above
1.0
NaN
2.0
undef
is very different from everything that has been mentioned so far. It is not really a value (at least not in the sense of a number having a value), but rather a "flag" that one can pass to array constructors to tell Julia not to initialize the values in the array (generally for performance considerations). In the following example, the array elements will not be set to any specific value but, since there is no such thing as a number without any value in Julia, elements will have arbitrary values (coming from whatever happens to be in memory where the vector gets allocated).
julia> Vector{Float64}(undef, 3)
3-element Array{Float64,1}:
6.94567437726575e-310
6.94569509953624e-310
6.94567437549977e-310
When elements are of more complex type (in technical words: non-isbits type) and a distinction can be made between initialized and uninitialized elements, Julia denotes the latter with #undef
julia> mutable struct Foo end
julia> Vector{Foo}(undef, 3)
3-element Array{Foo,1}:
#undef
#undef
#undef
I would summarize the options as follows. What I write is from the perspective of "reading values", but this is also guidance when "writing values".
nothing
means "the value does not exist" (e.g.findfirst
returnsnothing
when it does not find a value in a collection); it is of a separate typeNothing
missing
means "the value itself exists but we do not know it" (I would expect that normally you getmissing
in your data only if it is taken from an external source, e.g. you have a record of patient data and body temperature is missing (obviously it exists - simply it was not recorded); I do not think that any function from Base can return it except for the case it gotmissing
as an argument); it is of a separate typeMissing
NaN
- is just a numeric data (as opposed tonothing
andmissing
); it signals the user if that result of some operation on numeric values returnedNaN
; from my experience this is the only case whenNaN
should appear in your data (e.g. a result of0/0
)undef
is not a value you will see, it is only used in the form e.g.Vector{Int}(undef, 10)
to create an array without initializing its values (so this is only a performance optimization); you should use it only if you immediately want to initialize the elements of the array with some values you plan to compute (and usingundef
will lead to#undef
entries if the element type of the array is not bits type or a union of bits types; for bits types usingundef
to initialize an array just gives you some garbage)
Those are the standard rules. Now an exception is (and this is a typical practice in some other languages) that sometimes you might want to use NaN
for signaling a missing
or nothing
in a collection. It is not something that is recommended but it has one benefit, which you can see in this example:
julia> x1 = [1.0, NaN]
2-element Array{Float64,1}:
1.0
NaN
julia> x2 = [1.0, missing]
2-element Array{Union{Missing, Float64},1}:
1.0
missing
And as you can see as NaN
is a floating point value the element type of x1
array is just Float64
, while in x2
array the element type is a Union
. In some situations you might want to opt for x1
instead of x2
because it is a bit faster to perform operations against (checking e.g. for possibility of missing
has some minimal overhead). But this is a performance optimization that should normally not be done, as other people when they read Julia code normally thing that NaN
is a genuine NaN
, not a placeholder that signals missing
or nothing
.