# Neural Network returning NaN as output

With a quick look, and based on the analysis of your multiplier variants, it seems like the `NaN`

is produced by an **arithmetic underflow, caused by your gradients being too small** (*too close to absolute 0*).

This is the most suspicious part of the code:

```
f[j + i * 7] = (rows[j][i] == 0 ? .5f : rows[j][i] == 1 ? 0f : 1f);
```

If `rows[j][i] == 1`

then ** 0f** is stored. I don't know how this is managed by the neural network (

*or even java*), but mathematically speaking, a finite-sized

**float cannot include zero**.

Even if your code would alter the `0f`

with some extra salt, those array values' resultants would have some risk of becoming too close to zero. Due to limited precision when representing real numbers, **values very close to zero can not be represented**, hence the ** NaN**.

These values have a very friendly name: * subnormal numbers*.

Any non-zero number with magnitude smaller than the smallest normal number issubnormal.

**IEEE_754**

As with IEEE 754-1985, The standardrecommends 0 for signaling NaNs, 1 for quiet NaNs, so that a signaling NaNs can be quieted by changing only this bit to 1, while the reverse could yield the encoding of an infinity.

Above's text is important here: according to the standard, you are actually specifying a `NaN`

with any `0f`

value stored.

Even if the name is misleading, ** Float.MIN_VALUE** is a

**positive**value,

**higher than 0**:

The **real** minimum `float`

value is, in fact: ** -Float.MAX_VALUE**.

*Is floating point math subnormal?*

*Normalizing the gradients*

If you check the issue is only because of the `0f`

values, you could just alter them for other values that represent something similar; `Float.MIN_VALUE`

, `Float.MIN_NORMAL`

, and so on. Something like this, also in other possible parts of the code where this scenario could happen. Take these just as examples, and play with these ranges:

```
rows[j][i] == 1 ? Float.MIN_VALUE : 1f;
rows[j][i] == 1 ? Float.MIN_NORMAL : Float.MAX_VALUE/2;
rows[j][i] == 1 ? -Float.MAX_VALUE/2 : Float.MAX_VALUE/2;
```

Even so, this could also lead to a `NaN`

, based on how these values are altered.
If so, you should normalize the values. You could try applying a **GradientNormalizer** for this. In your network initialization, something like this should be defined, for each layer(*or for those who are problematic*):

```
new NeuralNetConfiguration
.Builder()
.weightInit(WeightInit.XAVIER)
(...)
.layer(new DenseLayer.Builder().nIn(42).nOut(30).activation(Activation.RELU)
.weightInit(WeightInit.XAVIER)
.gradientNormalization(GradientNormalization.RenormalizeL2PerLayer) //this
.build())
(...)
```

There are different normalizers, so choose which one fits your schema best, and which layers should include one. The options are:

GradientNormalization

RenormalizeL2PerLayer

Rescale gradients by dividing by the L2 norm of all gradients for the layer.

RenormalizeL2PerParamType

Rescale gradients by dividing by the L2 norm of the gradients, separately for each type of parameter within the layer. This differs from RenormalizeL2PerLayer in that here, each parameter type (weight, bias etc) is normalized separately. For example, in a MLP/FeedForward network (where G is the gradient vector), the output is as follows:

GOut_weight = G_weight / l2(G_weight) GOut_bias = G_bias / l2(G_bias)

ClipElementWiseAbsoluteValue

Clip the gradients on a per-element basis. For each gradient g, set g <- sign(g)max(maxAllowedValue,|g|). i.e., if a parameter gradient has absolute value greater than the threshold, truncate it. For example, if threshold = 5, then values in range -5<g<5 are unmodified; values <-5 are set to -5; values >5 are set to 5.

ClipL2PerLayer

Conditional renormalization. Somewhat similar to RenormalizeL2PerLayer, this strategy scales the gradients if and only if the L2 norm of the gradients (for entire layer) exceeds a specified threshold. Specifically, if G is gradient vector for the layer, then:

GOut = G if l2Norm(G) < threshold(i.e., no change)GOut = threshold * G / l2Norm(G)

ClipL2PerParamType

Conditional renormalization. Very similar to ClipL2PerLayer, however instead of clipping per layer, do clipping on each parameter type separately. For example in a recurrent neural network, input weight gradients, recurrent weight gradients and bias gradient are all clipped separately.

**Here** you can find a complete example of the application of these `GradientNormalizers`

.