Loss function works with reduce_mean but not reduce_sum

if you use reduce_sum instead of reduce_mean, then the gradient is much larger. Therefore, you should correspondingly narrow down the learning rate to make sure the training process can properly carry on.

The loss by summing across the batch is 1000 times larger (from skimming the code I think your training batch size is 1000) so your gradients and parameter updates are also 1000 times larger. The larger updates apparently lead to nans.

Generally learning rates are expressed per example so the loss to find the gradients for updates should be per example also. If the loss is per batch then the learning rate needs to be reduced by the batch size to get comparable training results.