General Understanding of Transformer Encoder and Decoder blocks

Hello,

I just finished the assignment of C4_W3. And even though it was possible to finish the assignments with the given help, I still don’t fully understand the difference between encoder and decoder blocks.

In the week 2 assignment we had this as the decoder block:
image

In the week 3 assignment we had this encoder block:

If we ignore the Input Embeddings / Positional encoding (which we also do on the decoder side) then it’s the same isn’t it?

In convolutional networks it was always obvious to me, which the encoder- and which the decoder-side would be (conv/maxpool → conv/upsampling). But here the only thing that tells me, which the decoder is, is at the moment this last part:
image

I missed something and I hope you can point out, what :slight_smile:

Thanks and Regards!

Hi @JonasK

Actually they differ in two main parts:

The encoder is on the left (lightly grey shaded), and the decoder is on the right (lightly grey shaded, longer).

  • First, the decoder usually uses Masked Multi-Head Attention (which means that it can look only at the previous tokens (words)) for its first Attention block. In other words, decoder only “queries” self and past tokens to calculate attention scores.

  • Second, the encoder uses Self-Attention which means its Q, K, V are constructed from the same inputs, while the decoder uses Cross-Attention (in its second Multi-Head attention block) which means its Q is constructed from one inputs (its own), while K, V are constructed from the other inputs (as you can see the two arrows come from the encoder block).

The Linear and Softmax parts are just for the output.

Cheers

3 Likes

Hey, thanks for the quick response. :slight_smile:
Yeah I saw the difference in the used attention (point 1). The second point I read / heard in this course before, but somehow wasn’t really aware of it. Now the picture is slowly coming together, Thanks :slight_smile: Will probably go over the 4 weeks of this course again in a month or so and then it should hopefully stick hehe.

Hi, @arvyzukai

For questions answering with context in particular, what is consumed by the encoder and decoder respectively?

My understanding is that during training, encoder consumes question + context, and decoder consumes question + context + full_answer. During inference, encoder consumes the same as training, and decoder consumes consumes question + context + answer_predicted_so_far. Is my understanding correct?

Your understanding seems to suggest that during training, the full answer is consumed by the decoder, which is usually not the case. The full answer is generally used as the target sequence for the decoder, not as its input. The input to the decoder is usually the answer-so-far during training.

Also, it’s not common to feed both the question and context into the decoder during inference. The question and context are usually consumed by the encoder, and the decoder starts with a minimal prompt (often just the question or a start token) and generates the answer incrementally based on the encoder’s output and the tokens generated so far.

Now, during inference, the decoder is seeded with just the question (and perhaps a start-of-sequence token). The decoder then starts generating the answer one token at a time. After each generation step, the newly generated token is appended to the input, and the decoder generates the next token based on this extended input.

2 Likes

So does this mean that the encoder has the entire sequence “in mind” and that the decoder has only the current and previous translated words ?
Because I get confused with the parallel computations that are done in the multi head attention process, that would suggest, if I understand things correctly, that the encoder has already a full representation of the original input sequence.
Am I correct ? Thank you already.

Hi @Nanini

Yes, but not only that, I think you are missing a distinction between training time and inference time (the actual use of the model).

During training the encoder gets the entire context input (the whole sequence from which it has to predict targets), it’s job is to represent the sequence as good as it can (the arrow that comes from left to the decoder (this arrow is of shape (batch_size, sequence_length, output_dim); same values for each Nx decoder block)

During training the decoder gets the entire target input (the whole shifted right sequence it has to predict) but it uses the Causal mask (teacher forcing) in the first Multi-Head Attention (hence the parallel computing can be used; in other words, when predicting the second word, the decoder does not have to wait for it’s prediction of the first - it can just use the “true” values as if it predicted itself because of the causal mask).

But during inference (when the trained model is actually used) the encoder gets the entire context input (same as in training) but the decoder now has to receive it’s own input (own predictions) - one token at a time.

Does that make sense?

2 Likes

yes great, thank you