Understanding the attention model in the assignment

Ritu_Pande · September 8, 2023, 5:35pm

While I have completed the assignment for week 1, I still have a few doubts regarding the way attention model is designed.

The embedding layer in the encoder and pre-attention decoder have the different weights ( correct?) . Then the embeddings of the tokens in the decoder and those in the encoder are in different namespaces. When we calculate the similarity in the attention layer between Q and K coming from different namespaces, would the similarity have any meaningful information ?
If I understand correctly from the lecture, the hidden layers from the pre-attention decoder should go as an input to the attention preparation layer. But in t he assignment we are taking the outputs of the LSTM layer as input to the attention layer, not the hidden layer. Why is it so or am I misunderstanding something here?

arvyzukai · September 8, 2023, 7:18pm

Hi @Ritu_Pande

In this Assignment yes, encoder and decoder have separate embedding layers.

I’m not sure what you mean by different “namespaces” (different Embedding layers?). But if I’m guessing right the essence of your question, then the answer should lie in prepare_attention_input function:

keys are encoder_activations (note, from the last LSTM layer, not the Embedding layer)
values are also encoder_activations
queries are decoder activations (again, form the last LSTM layer of the decoder, not the Embedding layer)

The model should learn to align the queries and keys(and also other weights) in such a way, that the model could correctly predict the German word. In other words, gradient flows from top to bottom in this picture (the arrows would be reversed):

The layer that have weights would be updated accordingly how well they contributed to the target label.

I think that should also answer your second question.

Cheers

Ritu_Pande · September 8, 2023, 8:06pm

Thanks. I had misinterpreted the inputs to the PrepareAttentionInput. It is clear to me now

Topic		Replies	Views
Video: NMT Model with Attention NLP with Attention Models week-module-1	5	391	December 21, 2023
Understanding of basic Attention code NLP with Attention Models week-module-1	3	555	August 13, 2023
Number of LSTM layers in the decoder? NLP with Attention Models week-module-1	1	592	May 21, 2022
NMT with Attention Model - modified architecture NLP with Attention Models week-module-1	2	34	August 27, 2024
Why do we need the pre-attention decoder? NLP with Attention Models week-module-1	8	651	October 11, 2023

Understanding the attention model in the assignment

Related topics