Transformer decoder architecture in course 2

Hi, I have a question concerning the Transformer Decoder architecture introduced in the Course lecture as :blush:


This architecture is not the same as the decoder of GPT-2 as it doesn’t include the masked Multi Head attention. In fact, It is absolutely the same as the encoder architecture (BERT).
You can fin dhere the true architecture of the decoder for GPT-2 from this link : https://www.researchgate.net/figure/GPT-2-model-architecture-The-GPT-2-model-contains-N-Transformer-decoder-blocks-as-shown_fig1_373352176
image

Could you please clarify more about the architecture introduced in the course ?

Hi!
I can confirm that this is incorrectly explained as masking is important in decoder block. What surprises me is that not just the diagram is incorrect but the instructor has also skipped the step of masking in their video. I will raise this to the course coordinator.

Thanks for catching this!

Thanks for the clarification.

Hi @Ibrahim_RIDENE

Even though the infographics in this lecture do not meet the highest standards, as it discusses a different architecture than the week’s assignment in the first place (the Assignment encoder-decoder instead of decoder-only).

But to be fair, it does mention the masking implicitly:

Note the “looks at previous words”

Cheers

2 Likes

hi @arvyzukai , thanks for clarification !
btw, I have a question concerning the Transformer summarizer video of the course. Since it discussed a way to build a summarizer based only on the decoder. The instructor mentioned that the input to the model must be :
[Article]< eos >[Summary]< eos >
In this case, what is the expected output ? He mentioned that it is the same as the input, but in this case we can’t have a recursive prediction of summary during testing as we will only make the article as input!
As we saw for seq2seq and transformer, if we need to predict words one by one, we need to have a one token gap between the input and the output exp : input : < sos > [ … ], output : […] < eos >. So the input during training must be : “”[Article]< eos >[Summary]“” and not “”[Article]< eos >[Summary]< eos >“”

Hi @Ibrahim_RIDENE

For the context - the Course video material was adapted from the previous trax framework decoder-only implementation to current TensorFlow encoder-decoder implementation.

To answer your question:

Yes, I believe your understanding is correct but it is also in line with the course material:

For training you would split at the [Summary] part. The “[Article]< eos >” would be an input and the [Summary] part would be the target (up to the < eos > if the model generates it early or up to the max length of the summary if < eos > never comes up).
At inference time you would feed the [Article]< eos >[Summary "up to the last predicted token] recursively.

Does that make sense?

Hi @arvyzukai ,
If my understanding is correct, we will feed only the [Article]< eos > as input to the model and the [summary] < eos > will serve only to caluclate the loss based on the predicted sentence.
In this case, I think we will face a problem during evaluation as the model is trained to have only the article as input (not article and Summary "up to the last predicted token) ?

Hi @Ibrahim_RIDENE

Yes, this is correct for the previous design.

No, not really. Evaluation is more about overfitting. In other words, during training the model could “memorize” (if big enough) the training set more and more and would generate exactly the same summaries (because it “saw” all the possible inputs/documents and desired outputs/summaries many times).
But, if it was never trained on evaluation set (as it should be the case), then we can check how well does it summarize the documents that it “never saw”.

I hope that makes sense.
Cheers

I mean by facing a problem when evaluation, the fact that we can’t have recursive implementation (during evalution) since for an article as input, the model will directly predict the whole sentence (not a word each time to be serve as the next input)

Hi @Ibrahim_RIDENE

That is not true. As I said before, essentially the model behaves exactly the same during training and evaluation, the only difference is the data (and some speed ups, like not calculating the gradients).

The Causal Mask (also called look-ahead mask) is the trick that allows Transformer decoders to train (and evaluate) on multiple tokens in parallel. In simple terms, we distribute many samples of “[Article]< eos >[n summary tokens]” for multiple gpu cores to predict the n’th+1 tokens. And the process essentially is the same for training and evaluation.

But during inference (a word for when we “actually use” the model after training), we don’t have the “real” summary, so we feed the model recursively it’s own predictions to predict the n-th+1 token.

Cheers

Hi @arvyzukai ,

Sorry, but I am really lost ! You said before that during training, we only feed the article as input (not article and summary)

But here you are saying that we are feeding to the model the article and the summary up to the nth token.

Could you please clarify which one is the correct ?
Basically, this is my initial misunderstanding as I didn’t understand what is exactly the input to the model.

Thanks

Hi @Ibrahim_RIDENE

Actually they are both correct. ML terminology can be/is confusing, so let me explain.

  • we define input as “[Article]< eos >”
  • we define target as “[Summary]”

But, since our target is a sequence (not a single token and dependent on each other; the model predicts them all at once but we care about only one token at a time) we use what is called “teacher forcing”. It means, that our input can be interpreted as:

  • input_gpu1 = “[Article]< eos >”, target_gpu1 = “word1”;
  • input_gpu2 = “[Article]< eos >[Summary word1]”, target_gpu2 = “word2”;
  • input_gpu3 = “[Article]< eos >[Summary word1 word2]”, target_gpu3 = “word3”;

These different inputs (with up to the n’th token part of a summary included) are not split manually, they are “simulated” with a causal mask for Attention - the model can only “see” (“pay attention”) up to an n’th token to make a prediction for the n’th+1 token.


And, as I said, it is the same procedure for training and evaluation. The difference is that we only evaluate how model is predicting on data that it never changed its weights on. For concrete example, the training data set could be:

  • [“The cat jumped on the table.” < eos > “Jumpy cat.”]
  • [“The dog sleeps on a mat.” < eos > “Lazy dog.”]

In other words, we change model weights according these inputs/targets. But we also want to know/evaluate how well the model would predict the summary of:

  • [“The monkey sits in a tree.” < eos > “Happy monkey.”]

In other words, we don’t change model’s weights on this combination, but just want to “see” how the training is improving.

And finally, after all the training and evaluation is done, we deploy the model for “real” (production/inference). Now there are no summaries as targets, so the input could be:

  • [“The manager sits in the office.” < eos >]

In this case, whatever the model predicts as the first token for the summary, the input for predicting the second token would be the original input + the models prediction for token1. Now, we would “manually” add the predicted tokens as inputs for further predictions.

I hope that clarifies the confusion.
Cheers

P.S. Reminder: this explanation is about the previous course (trax) Assignment’s design (decoder only). Current course (tensorflow) Assignment is different design (encoder-decoder).