Tensorflow NaN bug?
A bias free alternative.
Many of the other solutions use clipping to avoid an undefined gradient. Depending on your problem, clipping introduces bias and may not be acceptable in all cases. As the following code demonstrates, we need only handle the point of discontinuity--not the region near it.
Specific Answer
def cross_entropy(x, y, axis=-1):
safe_y = tf.where(tf.equal(x, 0.), tf.ones_like(y), y)
return -tf.reduce_sum(x * tf.log(safe_y), axis)
def entropy(x, axis=-1):
return cross_entropy(x, x, axis)
But did it work?
x = tf.constant([0.1, 0.2, 0., 0.7])
e = entropy(x)
# ==> 0.80181855
g = tf.gradients(e, x)[0]
# ==> array([1.30258512, 0.60943794, 0., -0.64332503], dtype=float32) Yay! No NaN.
(Note: deleted dup cross-post.)
General Recipe
Use an inner tf.where
to ensure the function has no asymptote.
That is, alter the input to the inf generating function such that no inf can be created.
Then use a second tf.where
to always select the valid code-path.
That is, implement the mathematical condition as you would "normally", i.e., the "naive" implementation.
In Python code, the recipe is:
Instead of this:
tf.where(x_ok, f(x), safe_f(x))
Do this:
safe_x = tf.where(x_ok, x, safe_x)
tf.where(x_ok, f(safe_x), safe_f(x))
Example
Suppose you wish to compute:
f(x) = { 1/x, x!=0
{ 0, x=0
A naive implementation results in NaNs in the gradient, i.e.,
def f(x):
x_ok = tf.not_equal(x, 0.)
f = lambda x: 1. / x
safe_f = tf.zeros_like
return tf.where(x_ok, f(x), safe_f(x))
Does it work?
x = tf.constant([-1., 0, 1])
tf.gradients(f(x), x)[0].eval()
# ==> array([ -1., nan, -1.], dtype=float32)
# ...bah! We have a NaN at the asymptote despite not having
# an asymptote in the non-differentiated result.
The basic pattern for avoiding NaN gradients when using tf.where
is to call tf.where
twice. The innermost tf.where
ensures that the result f(x)
is always finite. The outermost tf.where
ensures the correct result is chosen. For the running example, the trick plays out like this:
def safe_f(x):
x_ok = tf.not_equal(x, 0.)
f = lambda x: 1. / x
safe_f = tf.zeros_like
safe_x = tf.where(x_ok, x, tf.ones_like(x))
return tf.where(x_ok, f(safe_x), safe_f(x))
But did it work?
x = tf.constant([-1., 0, 1])
tf.gradients(safe_f(x), x)[0].eval()
# ==> array([-1., 0., -1.], dtype=float32)
# ...yay! double-where trick worked. Notice that the gradient
# is now a constant at the asymptote (as opposed to being NaN).
Actually, it turned out to be something stupid. I'm posting this in case anyone else would run into a similar error.
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.
Replacing it with
cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))
solved all my problems.
Actually, clipping is not a good idea as it will stop the gradient from propagating backwards when the threshold is reached. Instead we can add a little bit of constant to the softmax output.
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))