Should RNN attention weights over variable length sequences be re-normalized to "mask" the effects of zero-padding?
Great question! I believe your concern is valid and zero attention scores for the padded encoder outputs do affect the attention. However, there are few aspects that you have to keep in mind:
There are different score functions, the one in tf-rnn-attention uses simple linear + tanh + linear transformation. But even this score function can learn to output negative scores. If you look at the code and imagine
inputs
consists of zeros, vectorv
is not necessarily zero due to bias and the dot product withu_omega
can boost it further to low negative numbers (in other words, plain simple NN with a non-linearity can make both positive and negative predictions). Low negative scores don't water down the high scores in softmax.Due to bucketing technique, the sequences within a bucket usually have roughly the same length, so it's unlikely to have half of the input sequence padded with zeros. Of course, it doesn't fix anything, it just means that in real applications negative effect from the padding is naturally limited.
You mentioned it in the end, but I'd like to stress it too: the final attended output is the weighted sum of encoder outputs, i.e. relative values actually matter. Take your own example and compute the weighted sum in this case:
- the first one is
0.2 * o1 + 0.23 * o2
(the rest is zero) - the second one is
0.48 * o1 + 0.52 * o2
(the rest is zero too)
Yes, the magnitude of the second vector is two times bigger and it isn't a critical issue, because it goes then to the linear layer. But relative attention ono2
is just 7% higher, than it would have been with masking.What this means is that even if the attention weights won't do a good job in learning to ignore zero outputs, the end effect on the output vector is still good enough for the decoder to take the right outputs into account, in this case to concentrate on
o2
.- the first one is
Hope this convinces you that re-normalization isn't that critical, though probably will speed-up learning if actually applied.
BERT implementation applies a padding mask for calculating attention score. Adds 0 to the non-padding attention score and adds -10000 to padding attention scores. the e^-10000 is very small w.r.t to other attention score values.
attention_score = [0.1, 0.2, 0, 0, 0]
mask = [0, 0, -10000, -10000] # -10000 is a large negative value
attention_score += mask
weights = softmax(attention_score)