Understanding the attention model in the assignment

While I have completed the assignment for week 1, I still have a few doubts regarding the way attention model is designed.

  1. The embedding layer in the encoder and pre-attention decoder have the different weights ( correct?) . Then the embeddings of the tokens in the decoder and those in the encoder are in different namespaces. When we calculate the similarity in the attention layer between Q and K coming from different namespaces, would the similarity have any meaningful information ?

  2. If I understand correctly from the lecture, the hidden layers from the pre-attention decoder should go as an input to the attention preparation layer. But in t he assignment we are taking the outputs of the LSTM layer as input to the attention layer, not the hidden layer. Why is it so or am I misunderstanding something here?

Hi @Ritu_Pande

In this Assignment yes, encoder and decoder have separate embedding layers.

I’m not sure what you mean by different “namespaces” (different Embedding layers?). But if I’m guessing right the essence of your question, then the answer should lie in prepare_attention_input function:

  • keys are encoder_activations (note, from the last LSTM layer, not the Embedding layer)
  • values are also encoder_activations
  • queries are decoder activations (again, form the last LSTM layer of the decoder, not the Embedding layer)

The model should learn to align the queries and keys(and also other weights) in such a way, that the model could correctly predict the German word. In other words, gradient flows from top to bottom in this picture (the arrows would be reversed):

The layer that have weights would be updated accordingly how well they contributed to the target label.

I think that should also answer your second question.

Cheers

Thanks. I had misinterpreted the inputs to the PrepareAttentionInput. It is clear to me now

1 Like