How should we pad text sequence in keras using pad_sequences?
The problem is in this line:
tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ")
When you set such split (by " "
), due to nature of your data, you'll get each sequence consisting of a single word. That's why your padded sequences have only one non-zero element. To change that try:
txt="a b c d e f g h i j k l m n "*100
If you want to tokenize by char, you can do it manually, it's not too complex:
First build a vocabulary for your characters:
txt="abcdefghijklmn"*100
vocab_char = {k: (v+1) for k, v in zip(set(txt), range(len(set(txt))))}
vocab_char['<PAD>'] = 0
This will associate a distinct number for every character in your txt. The character with index 0 should be preserved for the padding.
Having the reverse vocabulary will be usefull to decode the output.
rvocab = {v: k for k, v in vocab.items()}
Once you have this, you can first split your text into sequences, say you want to have sequences of length seq_len = 13
:
[[vocab_char[char] for char in txt[i:(i+seq_len)]] for i in range(0,len(txt),seq_len)]
your output will look like :
[[9, 12, 6, 10, 8, 7, 2, 1, 5, 13, 11, 4, 3],
[14, 9, 12, 6, 10, 8, 7, 2, 1, 5, 13, 11, 4],
...,
[2, 1, 5, 13, 11, 4, 3, 14, 9, 12, 6, 10, 8],
[7, 2, 1, 5, 13, 11, 4, 3, 14]]
Note that the last sequence doesn't have the same length, you can discard it or pad your sequence to max_len = 13, it will add 0's to it.
You can build your targets Y the same way, by shifting everything by 1. :-)
I hope this helps.