Why do we need the pre-attention decoder?

Mohammad_Atif_Khan · December 26, 2022, 4:24pm

I’m trying to understand the motivation behind why a pre-attention decoder is needed.

Can we not simply use the actual decoder as per the original design?

nilosreesengupta · December 26, 2022, 11:47pm

Attention models are more powerful mechanism than normal decoders. For working with bigger data - long paragraphs, long sentences or greater amount of information retrieval, the decoders ain’t appropriate as information carried by tokens are lost.

To solve this, attention models are used.

“Attention is a layer of calculations that let your model focus on the most important parts of the sequence for each step. Queries, values, and keys are representations of the encoder and decoder hidden states. And they’re used to retrieve information inside the attention layer by calculating similarity between the decoder queries and the encoder key value pairs. This is so flexible, it can even find matches between languages with very different grammatical structures or alphabets.”

Go through this video :

With regards,
Nilosree Sengupta

Mohammad_Atif_Khan · December 27, 2022, 2:51pm

Hi @nilosreesengupta ,
Thanks but my question is about the pre-attention decoder. Why do we need that? Does it serve another purpose that teacher forcing?

arvyzukai · December 28, 2022, 7:44am

Hi @Mohammad_Atif_Khan

The pre-attention decoder runs on the targets and creates activations that are used as queries in attention.

In this diagram, pre-attention decoder will encode the word “Wie”. (Not important: I’m not sure why it is not called pre-attention encoder because it encodes the targets.)

So the input encoder encodes inputs (“How are you today?”) as (“Keys” and “Values” for AttentionQKV), and the pre-attention decoder encodes “Wie” as a “Queries” for AttentionQKV). Given this information, AttentionQKV will score each of the encoder hidden states to know which one the decoder (the output decoder, another LSTM) should focus on to produce the next word.

I hope this clarifies things, if not, feel free to ask Cheers

newboadki · October 9, 2023, 9:47am

@arvyzukai , in the lectures it was mentioned that " The pass of the hidden states from the decoder tothe Attention Mechanism could not be easy to implement. Instead, you will be using two decoders". So it seams like that the main reason why we need the pre-attention decoder is because it is a simplification?

Does this mean that the pre-attention decoder architecture is not widely used? Only for learning setups?
What are the names of other implementations that might pass the decoder’s previous hidden state to the attention layer?

Secondly, my understanding is that instead of passing the decoder’s previous hidden state vector to the attention layer, what we are doing is encoding (as suggested by arvyzukai) the target shifted right, which I understand in this context as assuming that the decoder got the prediction right at every single step (teacher forcing). So we are trying to approximate the decoder’s previous state by encoding a vector from the target as if it came from the decoder? Is my interpretation correct or can it serve as an intuition or it’s totally derailed?

Thanks.

arvyzukai · October 10, 2023, 1:27pm

Hi @newboadki

I would not think of pre-attention decoder as trying to approximate the decoder’s previous hidden state. Its goal is to encode the target “state” at every step in a sequence.

If pre-attention decoder wouldn’t have an LSTM in it, then the “queries” would be without context (for example, each “bank” would have the same representations regardless if there is “river” or “money” in the sequence). But since there is the LSTM layer after the embedding layer, each word is context aware (for example, each “bank” have different representations depending on other words including like “river” or “money”, or if this word is at the start or end of the sentence, etc.).

So, the pre-attention decoder produces the “queries” for the attention so that it could “align” them with encoder “keys”. In overly simple terms, pre-attention decoder asks what word should be translated.

In a silly hypothetical example, the pre-attention decoder at each step in a sequence creates questions like:

at step 1: “Hey, encoder, I’m a “<s>”. What do you have for me?”,
at step 2: “Hey, encoder, I’m a “<s> Wie”. What do you have for me?”,
at step 3: “Hey, encoder, I’m a “<s> Wie geht”. What do you have for me?”
etc.

In the same silly example, the input encoder at each step in a sequence prepares keys and values like:

at step 1: "If you are looking for 1st word “How” (“key”), here is the what it (first word) represents - [0.1, …, -0.2] (“value”)
at step 2: "If you are looking for 2nd word “are” (“key”), here is the what it represents - [0.5, …, 0.3] (“value”)
at step 3: "If you are looking for 3rd word “you” (“key”), here is the what it represents - [-1.5, …, 0.1] (“value”)
etc.

So later the attention mechanism aligns which input word matches what the decoder is looking for and aggregates its values for every step. In the silly example, the output of the attention layer could be:

at step 1: [0.1, …, -0.21]
at step 2: [0.51, …, 0.29]
at step 3: [-1.49, …, 0.11]
etc.
note, that in this silly example words align exactly (first word with first word, etc.) but it must not have been the case and the attention mechanism could have switched step 2 with step 3 for example. Also the values are not exactly the same as in encoder, since attention aggregates some values from other input tokens.

So now, the “real” decoder (Step 7 in the picture) can generate the translation out of these context vectors.

I guess this is a long way of saying, that the pre-attention decoder outputs/activations do not have to be approximate (or even the same shape) as “real” decoder outputs because its goal is to ask right questions instead of trying to “be” the translated word (or more precisely to be best values for the following Dense layer).

Cheers

newboadki · October 11, 2023, 6:58am

Thank you @arvyzukai for your detailed answer. The role of the pre-attention decoder after your explanation is clear.

The part that I am still confused about, is that in the lectures it was said that in the original forms of attention the inputs for the attention layer, would be:

The encoder’s hidden states (h1, h2, …hn)
The decoder’s previous state hidden vector (s_i-1)

However, in the lecture ‘NMT Model with Attention’ , it was said:

Recall that the decoder is supposed to pass hidden states to the Attention Mechanism to get context vectors. The pass of the hidden states from the decoder to the Attention Mechanism could not be easy to implement. Instead, you will be using two decoders.

So in my mind, I formed the idea of that the pre-attention decoder must be filling the role of the decoder’s previous state hidden vector. But I get now that “Its goal is to encode the target “state” at every step in a sequence.”

What is this other solution mentioned in the lectures, that is hard to implement? And is the encoder, pre-attention decoder the standard implementation of scaled dot-product attention for neural machine translation?

arvyzukai · October 11, 2023, 12:05pm

Hi @newboadki

The other solution (if I recall correctly) that is mentioned in the lectures is the original architecture of this paper (one of the most influential ideas that gave rise to “Attention is All You Need” paper).

I’m not sure why it was “not easy to implement”, but if I had to guess it has something to do with trax trainer class. Because, in theory, the architecture without the pre-attention decoder is simpler (I had the chance to implement it while reading in a book about this with PyTorch and also in Excel ).

Btw, the model is further more complex from the original because of the tl.AttentionQKV layer which (code) additionally use three Dense layers in Attention:

      Parallel_in3_out3[
        Dense_1024
        Dense_1024
        Dense_1024
      ]

In other words, the Assignment model additionally linearly transforms the Q, K, V that are in the diagram. (Never mind if I introduced more confusion )

This particular architecture (and also the one in the linked paper) is not used very often these days for NMT because transformers took over this field.

Also, if you are interested in different Attention mechanisms I would recommend to read Attention Mechanisms and Transformers chapter.

Cheers

newboadki · October 11, 2023, 5:05pm

Thank you for all the answers @arvyzukai , great as always!

Topic		Replies	Views
Video: NMT Model with Attention NLP with Attention Models week-module-1	5	387	December 21, 2023
Week 3 Assignment 1 Neural_machine_translation_with_attention Sequence Models coursera-platform	1	666	April 21, 2022
Number of LSTM layers in the decoder? NLP with Attention Models week-module-1	1	586	May 21, 2022
NMT with Attention Model - modified architecture NLP with Attention Models week-module-1	2	30	August 27, 2024
NMT Week-1 Assignment - Training NLP with Attention Models week-module-1	4	484	February 4, 2024

Why do we need the pre-attention decoder?

Related topics