What is the difference between Luong attention and Bahdanau attention?
I went through this Effective Approaches to Attention-based Neural Machine Translation. In the section 3.1 They have mentioned the difference between two attentions as follows,
Luong attention used top hidden layer states in both of encoder and decoder. But Bahdanau attention take concatenation of forward and backward source hidden state (Top Hidden Layer).
In Luong attention they get the decoder hidden state at time t. Then calculate attention scores and from that get the context vector which will be concatenated with hidden state of the decoder and then predict.
But in the Bahdanau at time t we consider about t-1 hidden state of the decoder. Then we calculate alignment , context vectors as above. But then we concatenate this context with hidden state of the decoder at t-1. So before the softmax this concatenated vector goes inside a GRU.
Luong has diffferent types of alignments. Bahdanau has only concat score alignment model.
There are actually many differences besides the scoring and the local/global attention. A brief summary of the differences:
- Bahdanau et al use an extra function to derive hs_{t-1} from hs_t. I didn't see a good reason anywhere on why they do this but a paper by Pascanu et al throws a clue..maybe they are looking to make the RNN deeper. Luong of course uses the hs_t directly
- Bahdanau recommend uni-directional encoder and bi-directional decoder. Luong has both as uni-directional. Luong also recommends taking just the top layer outputs; in general, their model is simpler
- The more famous one - There is no dot product of hs_{t-1} (the decoder output) with encoder states in Bahdanau's. Instead they use separate weights for both and do an addition instead of a multiplication. This perplexed me for a long while as multiplication is more intuitive, until I read somewhere that addition is less resource intensive...so there are tradeoffs
- in Bahdanau, we have a choice to use more than one unit to determine w and u - the weights that are applied individually on the decoder hidden state at t-1 and the encoder hidden states. Having done that, we need to massage the tensor shape back & hence, there is a need for a multiplication with another weight v. Determining v is a simple linear transformation and needs just 1 unit
- Luong gives us local attention in addition to global attention. Local attention is a combination of soft and hard attention
- Luong gives us many other ways to calculate the attention weights..most involving a dot product..hence the name multiplcative. I think there were 4 such equations. We can pick and choose the one we want
- There are some minor changes like Luong concatenates the context and the decoder hidden state and uses one weight instead of 2 separate ones
- Last and the most important one is that Luong feeds the attentional vector to the next time-step as they believe that past attention weight history is important and helps predict better values
The good news is that most are superficial changes. Attention as a concept is so powerful that any basic implementation suffices. There are 2 things that seem to matter though - the passing of attentional vectors to the next time step and the concept of local attention(esp if resources are constrained). The rest dont influence the output in a big way.
For more specific details, please refer https://towardsdatascience.com/create-your-own-custom-attention-layer-understand-all-flavours-2201b5e8be9e
They are very well explained in a PyTorch seq2seq tutorial.
The main difference is how to score similarities between the current decoder input and encoder outputs.
I just wanted to add a picture for a better understanding to the @shamane-siriwardhana
the main difference is in the output of the decoder network