Keras: Making a neural network to find a number's modulus
UPD
After some tinkering I was able to get to a reasonably good solution using RNNs. It trains on less than 5% of all possible unique inputs and gives >90% accuracy on the random test sample. You can increase number of batches to 100 from 40 to make it a bit more accurate (though in some runs there is a chance the model won't converge to the right answer - here it is higher than usually). I have switched to using Adam optimizer here and had to increase number of samples to 50K (10K led to overfitting for me).
Please understand that this solution is a bit of a tongue-in-cheek thing, because it is based on the task-domain knowledge that our target function can be defined by a simple recurring formula on the sequence of input bits (even simpler formula if you reverse your input bit sequence, but using go_backwards=True
in LSTM didn't help here).
If you inverse the input bits order (so that we always start with the most significant bit) than the recurring formula for the target function is just F_n = G(F_{n-1}, x_n)
, where F_n = MOD([x_1,...,x_n], 7)
, and G(x, y) = MOD(2*x+y, 7)
- only has 49 different inputs and 7 possible outputs. So the model kind of have to learn initial state + this G
update function. For the sequence starting with the least significant bit the recurring formula is slightly more complicated cause it will also need to keep track on what is current MOD(2**n, 7)
on each step, but it seems that this difficulty doesn't matter for training.
Please note - these formulas are only to explain why RNN works here. The net below is just a plain LSTM layer + softmax with original input of bits treated as a sequence.
Full code for the answer using RNN layer:
import keras.models
import numpy as np
from python_toolbox import random_tools
RADIX = 7
FEATURE_BITS = 20
def _get_number(vector):
return sum(x * 2 ** i for i, x in enumerate(vector))
def _get_mod_result(vector):
return _get_number(vector) % RADIX
def _number_to_vector(number):
binary_string = bin(number)[2:]
if len(binary_string) > FEATURE_BITS:
raise NotImplementedError
bits = (((0,) * (FEATURE_BITS - len(binary_string))) +
tuple(map(int, binary_string)))[::-1]
assert len(bits) == FEATURE_BITS
return np.c_[bits]
def get_mod_result_vector(vector):
v = np.repeat(0, 7)
v[_get_mod_result(vector)] = 1
return v
def main():
model = keras.models.Sequential(
(
keras.layers.Reshape(
(1, -1)
),
keras.layers.LSTM(
units=100,
),
keras.layers.Dense(
units=7, activation='softmax'
)
)
)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.01),
loss='categorical_crossentropy',
metrics=['accuracy'])
data = np.random.randint(2, size=(50000, FEATURE_BITS))
labels = np.vstack(map(get_mod_result_vector, data))
model.fit(data, labels, epochs=40, batch_size=50)
def predict(number):
foo = model.predict(_number_to_vector(number))
return np.argmax(foo)
def is_correct_for_number(x):
return bool(predict(x) == x % RADIX)
sample = random_tools.shuffled(range(2 ** FEATURE_BITS))[:500]
print('Total accuracy:')
print(sum(map(is_correct_for_number, sample)) / len(sample))
print(f'(Accuracy of random algorithm is {1/RADIX:.2f}')
if __name__ == '__main__':
main()
ORIGINAL ANSWER
I'm not sure how it happened, but the particular task you chose to check your code is extremely difficult for a NN. I think the best explanation would be that NNs are not really good when features are interconnected in such way that changing one feature always change value of your target output completely. One way to look at it would be to see the sets of features when you expect a certain answer - in your case they will look like unions of very large number of parallel hyper planes in 20 dimensional space - and for each of 7 categories these sets of planes are "nicely" interleaved and left for NN to distinguish.
That said - if your number of examples is large, say 10K and number of possible inputs is smaller, say your input bit numbers are only 8 bits large (so 256 unique inputs possible only) - networks should "learn" the right function quite ok (by "remembering" correct answers for every input, without generalization). In your case that doesn't happen because the code has the following bug.
Your labels were 20-dimensional vectors with bits of 0-6 integer (your actual desired label) - so I guess you were pretty much trying to teach NN to learn bits of the answer as separate classifiers (with only 3 bits ever possible to be non-zero). I changed that to what I assume you actually wanted - vectors of length 7 with only one value being 1 and others 0 (so-called one hot encoding which keras actually expects for categorical_crossentropy
according to this). If you wanted to try to learn each bit separately you definitely shouldn't have used softmax 20 in the last layer, cause such output generates probabilities on 20 classes which sum up to 1 (in that case you should have trained 20-or-rather-3 binary classifiers instead). Since your code didn't give keras correct input the model you got in the end was kind of random and with rounding you applied was intented to output the same value for 95%-100% of inputs.
Slightly changed code below trains a model which can more or less correctly guess the mod 7 answer for every number 0 to 255 (again, pretty much remembers the correct answer for every input). If you try to increase FEATURE_BITS
you will see large degradation of the results. If you actually want to train NN to learn this task as is with 20 or more bits of input (and without supplying NN with all possible inputs and infinite time to train) you will need to apply some task-specific feature transformations and/or some layers carefully designed to exactly be good at task you want to achieve as others already mentioned in comments to your question.
import keras.models
import numpy as np
from python_toolbox import random_tools
RADIX = 7
FEATURE_BITS = 8
def _get_number(vector):
return sum(x * 2 ** i for i, x in enumerate(vector))
def _get_mod_result(vector):
return _get_number(vector) % RADIX
def _number_to_vector(number):
binary_string = bin(number)[2:]
if len(binary_string) > FEATURE_BITS:
raise NotImplementedError
bits = (((0,) * (FEATURE_BITS - len(binary_string))) +
tuple(map(int, binary_string)))[::-1]
assert len(bits) == FEATURE_BITS
return np.c_[bits]
def get_mod_result_vector(vector):
v = np.repeat(0, 7)
v[_get_mod_result(vector)] = 1
return v
def main():
model = keras.models.Sequential(
(
keras.layers.Dense(
units=20, activation='relu', input_dim=FEATURE_BITS
),
keras.layers.Dense(
units=20, activation='relu'
),
keras.layers.Dense(
units=7, activation='softmax'
)
)
)
model.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
data = np.random.randint(2, size=(10000, FEATURE_BITS))
labels = np.vstack(map(get_mod_result_vector, data))
model.fit(data, labels, epochs=100, batch_size=50)
def predict(number):
foo = model.predict(_number_to_vector(number))
return np.argmax(foo)
def is_correct_for_number(x):
return bool(predict(x) == x % RADIX)
sample = random_tools.shuffled(range(2 ** FEATURE_BITS))[:500]
print('Total accuracy:')
print(sum(map(is_correct_for_number, sample)) / len(sample))
print(f'(Accuracy of random algorithm is {1/RADIX:.2f}')
if __name__ == '__main__':
main()
This achieves an accuracy of 99.74% and a validation accuracy of 99.69%.
import tensorflow as tf, numpy as np
def int2bits(i,fill=20):
return list(map(int,bin(i)[2:].zfill(fill)))
def bits2int(b):
return sum(i*2**n for n,i in enumerate(reversed(b)))
# Data.
I = np.random.randint(0,2**20,size=(250_000,))
X = np.array(list(map(int2bits,I)))
Y = np.array([int2bits(2**i,7) for i in I % 7])
# Test Data.
It = np.random.randint(0,2**20,size=(10_000,))
Xt = np.array(list(map(int2bits,It)))
Yt = np.array([int2bits(2**i,7) for i in It % 7])
# Model.
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(1000,'relu'),
tf.keras.layers.Dense(7,'softmax'),
])
model.compile('adam','categorical_crossentropy',['accuracy'])
# Train.
model.fit(X,Y,10_000,100,validation_data=(Xt,Yt))
Some take-aways:
1) You had way too little data. You were uniformly sampling points from 0 to 2**20, but only sampled 10,000, which is only about 1% of the possible vectors that the model is suppose to learn about. The point is that a lot of components (in the binary representation) would be mostly fixed at zero or one without any opportunity to learn how they function in the overall data or how they interact with other components.
2) You needed an embedding layer, namely extend the space into some massive higher dimension, so the neurons can move around more easily. This allows the learning to shuffle things better hopefully finding the algorithm your looking for. A single Dense(1000) seems to work.
3) Ran batches of 10_000 (just so I maximize my CPU usage). Ran 100 epochs. Included my validation_data in the training so I could see how the validation set performs at each epoch (including this doesn't effect the training, just makes it easier to see if the model is doing well, while training).
Thanks. :-)