General Understanding of Transformer Encoder and Decoder blocks

JonasK · April 7, 2023, 1:30pm

Hello,

I just finished the assignment of C4_W3. And even though it was possible to finish the assignments with the given help, I still don’t fully understand the difference between encoder and decoder blocks.

In the week 2 assignment we had this as the decoder block:

In the week 3 assignment we had this encoder block:

If we ignore the Input Embeddings / Positional encoding (which we also do on the decoder side) then it’s the same isn’t it?

In convolutional networks it was always obvious to me, which the encoder- and which the decoder-side would be (conv/maxpool → conv/upsampling). But here the only thing that tells me, which the decoder is, is at the moment this last part:

I missed something and I hope you can point out, what

Thanks and Regards!

arvyzukai · April 7, 2023, 1:53pm

Hi @JonasK

Actually they differ in two main parts:

The encoder is on the left (lightly grey shaded), and the decoder is on the right (lightly grey shaded, longer).

First, the decoder usually uses Masked Multi-Head Attention (which means that it can look only at the previous tokens (words)) for its first Attention block. In other words, decoder only “queries” self and past tokens to calculate attention scores.
Second, the encoder uses Self-Attention which means its Q, K, V are constructed from the same inputs, while the decoder uses Cross-Attention (in its second Multi-Head attention block) which means its Q is constructed from one inputs (its own), while K, V are constructed from the other inputs (as you can see the two arrows come from the encoder block).

The Linear and Softmax parts are just for the output.

Cheers

JonasK · April 7, 2023, 3:09pm

Hey, thanks for the quick response.
Yeah I saw the difference in the used attention (point 1). The second point I read / heard in this course before, but somehow wasn’t really aware of it. Now the picture is slowly coming together, Thanks Will probably go over the 4 weeks of this course again in a month or so and then it should hopefully stick hehe.

Peixi_Zhu · September 3, 2023, 2:24pm

Hi, @arvyzukai

For questions answering with context in particular, what is consumed by the encoder and decoder respectively?

My understanding is that during training, encoder consumes question + context, and decoder consumes question + context + full_answer. During inference, encoder consumes the same as training, and decoder consumes consumes question + context + answer_predicted_so_far. Is my understanding correct?

Juan_Olano · September 3, 2023, 2:53pm

Your understanding seems to suggest that during training, the full answer is consumed by the decoder, which is usually not the case. The full answer is generally used as the target sequence for the decoder, not as its input. The input to the decoder is usually the answer-so-far during training.

Also, it’s not common to feed both the question and context into the decoder during inference. The question and context are usually consumed by the encoder, and the decoder starts with a minimal prompt (often just the question or a start token) and generates the answer incrementally based on the encoder’s output and the tokens generated so far.

Now, during inference, the decoder is seeded with just the question (and perhaps a start-of-sequence token). The decoder then starts generating the answer one token at a time. After each generation step, the newly generated token is appended to the input, and the decoder generates the next token based on this extended input.

Nanini · January 22, 2024, 3:29pm

So does this mean that the encoder has the entire sequence “in mind” and that the decoder has only the current and previous translated words ?
Because I get confused with the parallel computations that are done in the multi head attention process, that would suggest, if I understand things correctly, that the encoder has already a full representation of the original input sequence.
Am I correct ? Thank you already.

arvyzukai · January 22, 2024, 4:56pm

Hi @Nanini

Yes, but not only that, I think you are missing a distinction between training time and inference time (the actual use of the model).

During training the encoder gets the entire context input (the whole sequence from which it has to predict targets), it’s job is to represent the sequence as good as it can (the arrow that comes from left to the decoder (this arrow is of shape (batch_size, sequence_length, output_dim); same values for each Nx decoder block)

During training the decoder gets the entire target input (the whole shifted right sequence it has to predict) but it uses the Causal mask (teacher forcing) in the first Multi-Head Attention (hence the parallel computing can be used; in other words, when predicting the second word, the decoder does not have to wait for it’s prediction of the first - it can just use the “true” values as if it predicted itself because of the causal mask).

But during inference (when the trained model is actually used) the encoder gets the entire context input (same as in training) but the decoder now has to receive it’s own input (own predictions) - one token at a time.

Does that make sense?

Nanini · January 22, 2024, 10:55pm

yes great, thank you

Topic		Replies	Views
Problem with transformer NLP with Attention Models week-2	1	481	May 28, 2023
Attention is all you need paper discussions - Transformers Generative AI with Large Language Models general	4	92	June 28, 2024
Something is wrong in the Decoder Block (of the Week2 ): Contradiction with the paper "Attention is all you need" NLP with Attention Models week-2	6	699	January 31, 2022
Pretraining decoder-only models on sequence modelling NLP with Attention Models week-3	1	420	August 21, 2023
Encoder blocks dimension NLP with Attention Models week-3	3	523	August 9, 2022

General Understanding of Transformer Encoder and Decoder blocks

Related topics