tflearn / tensorflow does not learn xor

In addition to @Ishamael's advice, consider using a different loss function. Mean squared error is not generally a good choice for sigmoid activations because the gradient can shrink too small to be useful for learning due to saturation.


I've decided to add another answer: I've done some more research and have some substantially different advice to provide.

After skimming this paper, it dawned on me that the reason why you're not seeing convergence might have to do with the initial weights. The paper specifically references some work by Hirose et al (Hirose, Yamashita, and Hijiya 1991) that found that initialization with a limited range of weights results in a very low probability of convergence. The "sweet spot" seemed to be a range between 0.5 and 1 on average to reliably converge.

It turns out that tflearn will default to using truncated normal initialization with a stddev of 0.02. So the weights have a very limited range. I've found that I can get reasonably reliable results using random uniform initialization of -1.0 to 1.0.

Also, incidentally it turns out that you've added a 3rd layer. XOR requires only one hidden layer, so you can remove the second one. Here's the code that works for me:

import tensorflow as tf
import tflearn

X = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]]
Y_xor = [[0.], [1.], [1.], [0.]]

# Graph definition
with tf.Graph().as_default():
    tnorm = tflearn.initializations.uniform(minval=-1.0, maxval=1.0)
    net = tflearn.input_data(shape=[None, 2])
    net = tflearn.fully_connected(net, 2, activation='sigmoid', weights_init=tnorm)
    net = tflearn.fully_connected(net, 1, activation='sigmoid', weights_init=tnorm)
    regressor = tflearn.regression(net, optimizer='sgd', learning_rate=2., loss='mean_square')

    # Training
    m = tflearn.DNN(regressor)
    m.fit(X, Y_xor, n_epoch=10000, snapshot_epoch=False) 

    # Testing
    print("Testing XOR operator")
    print("0 xor 0:", m.predict([[0., 0.]]))
    print("0 xor 1:", m.predict([[0., 1.]]))
    print("1 xor 0:", m.predict([[1., 0.]]))
    print("1 xor 1:", m.predict([[1., 1.]]))

Note that I am using mean square error. To my surprise, it seems to work best for this problem. Cross-entropy seems to cause the optimizer to languish in relatively flat regions of the problem space. I would have expected the opposite; maybe someone better versed in the mathematics will be able to better explain that.


The network with relus (as it is written in the code snippet) is expected to often fail to train. The reason for that is that if the input to relu is less than zero, the output is zero, and therefore the gradient going back is also zero.

Since you have two layers, each having only two relu units, with random initialization each of these two layers has 25% of having all its neurons returning zero, and therefore having zero gradient going back => neural network will not learn at all. In such a network the output of the last layer (before the final sigmoid) will be zero, sigmoid of which is 0.5 -- exactly what you observe on the attempts on which your network didn't converge.

Since each layer has 25% chance of doing this damage, the entire network has a total chance of around 45% (1 - (1 - 0.25)^2) of failing to train from the get go. There's also a non-zero chance that the network is not in such a state at the beginning, but happens to bring itself to such a state during training, further increasing the chance of divergence.

With four neurons the chance will be significantly lower, but still not zero.

Now, the only thing I cannot answer is why your network doesn't converge when you replace relu with sigmoid -- such a network should be always able to learn "xor". My only hypothesis is that you replaced only one relu with sigmoid, not both of them.

Can you replace both relus with sigmoids and confirm you still observe divergence?