Neural Network Always Produces Same/Similar Outputs for Any Input
So I realise this is extremely late for the original post, but I came across this because I was having a similar problem and none of the reasons posted here cover what was wrong in my case.
I was working on a simple regression problem, but every time I trained the network it would converge to a point where it was giving me the same output (or sometimes a few different outputs) for each input. I played with the learning rate, the number of hidden layers/nodes, the optimization algorithm etc but it made no difference. Even when I looked at a ridiculously simple example, trying to predict the output (1d) of two different inputs (1d):
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
class net(nn.Module):
def __init__(self, obs_size, hidden_size):
super(net, self).__init__()
self.fc = nn.Linear(obs_size, hidden_size)
self.out = nn.Linear(hidden_size, 1)
def forward(self, obs):
h = F.relu(self.fc(obs))
return self.out(h)
inputs = np.array([[0.5],[0.9]])
targets = torch.tensor([3.0, 2.0], dtype=torch.float32)
network = net(1,5)
optimizer = torch.optim.Adam(network.parameters(), lr=0.001)
for i in range(10000):
out = network(torch.tensor(inputs, dtype=torch.float32))
loss = F.mse_loss(out, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("Loss: %f outputs: %f, %f"%(loss.data.numpy(), out.data.numpy()[0], out.data.numpy()[1]))
but STILL it was always outputting the average value of the outputs for both inputs. It turns out the reason is that the dimensions of my outputs and targets were not the same: the targets were Size[2], and the outputs were Size[2,1], and for some reason PyTorch was broadcasting the outputs to be Size[2,2] in the MSE loss, which completely messes everything up. Once I changed:
targets = torch.tensor([3.0, 2.0], dtype=torch.float32)
to
targets = torch.tensor([[3.0], [2.0]], dtype=torch.float32)
It worked as it should. This was obviously done with PyTorch, but I suspect maybe other libraries broadcast variables in the same way.
For me it was happening exactly like in your case, the output of neural network was always the same no matter the training & number of layers etc.
Turns out my back-propagation algorithm had a problem. At one place I was multiplying by -1 where it wasn't required.
There could be another problem like this. The question is how to debug it?
Steps to debug:
Step1 : Write the algorithm such that it can take variable number of input layers and variable number of input & output nodes.
Step2 : Reduce the hidden layers to 0. Reduce input to 2 nodes, output to 1 node.
Step3 : Now train for binary-OR-Operation.
Step4 : If it converges correctly, go to Step 8.
Step5 : If it doesn't converge, train it only for 1 training sample
Step6 : Print all the forward and prognostication variables (weights, node-outputs, deltas etc)
Step7 : Take pen&paper and calculate all the variables manually.
Step8 : Cross verify the values with algorithm.
Step9 : If you don't find any problem with 0 hidden layers. Increase hidden layer size to 1. Repeat step 5,6,7,8
It sounds like a lot of work, but it works very well IMHO.
I've had similar problems, but was able to solve by changing these:
- Scale down the problem to manageable size. I first tried too many inputs, with too many hidden layer units. Once I scaled down the problem, I could see if the solution to the smaller problem was working. This also works because when it's scaled down, the times to compute the weights drop down significantly, so I can try many different things without waiting.
- Make sure you have enough hidden units. This was a major problem for me. I had about 900 inputs connecting to ~10 units in the hidden layer. This was way too small to quickly converge. But also became very slow if I added additional units. Scaling down the number of inputs helped a lot.
- Change the activation function and its parameters. I was using tanh at first. I tried other functions: sigmoid, normalized sigmoid, Gaussian, etc.. I also found that changing the function parameters to make the functions steeper or shallower affected how quickly the network converged.
- Change learning algorithm parameters. Try different learning rates (0.01 to 0.9). Also try different momentum parameters, if your algo supports it (0.1 to 0.9).
Hope this helps those who find this thread on Google!