Hi @Ibrahim_RIDENE
Actually they are both correct. ML terminology can be/is confusing, so let me explain.
- we define input as “[Article]< eos >”
- we define target as “[Summary]”
But, since our target is a sequence (not a single token and dependent on each other; the model predicts them all at once but we care about only one token at a time) we use what is called “teacher forcing”. It means, that our input can be interpreted as:
- input_gpu1 = “[Article]< eos >”, target_gpu1 = “word1”;
- input_gpu2 = “[Article]< eos >[Summary word1]”, target_gpu2 = “word2”;
- input_gpu3 = “[Article]< eos >[Summary word1 word2]”, target_gpu3 = “word3”;
…
These different inputs (with up to the n’th token part of a summary included) are not split manually, they are “simulated” with a causal mask for Attention - the model can only “see” (“pay attention”) up to an n’th token to make a prediction for the n’th+1 token.
And, as I said, it is the same procedure for training and evaluation. The difference is that we only evaluate how model is predicting on data that it never changed its weights on. For concrete example, the training data set could be:
- [“The cat jumped on the table.” < eos > “Jumpy cat.”]
- [“The dog sleeps on a mat.” < eos > “Lazy dog.”]
In other words, we change model weights according these inputs/targets. But we also want to know/evaluate how well the model would predict the summary of:
- [“The monkey sits in a tree.” < eos > “Happy monkey.”]
In other words, we don’t change model’s weights on this combination, but just want to “see” how the training is improving.
And finally, after all the training and evaluation is done, we deploy the model for “real” (production/inference). Now there are no summaries as targets, so the input could be:
- [“The manager sits in the office.” < eos >]
In this case, whatever the model predicts as the first token for the summary, the input for predicting the second token would be the original input + the models prediction for token1. Now, we would “manually” add the predicted tokens as inputs for further predictions.
I hope that clarifies the confusion.
Cheers
P.S. Reminder: this explanation is about the previous course (trax) Assignment’s design (decoder only). Current course (tensorflow) Assignment is different design (encoder-decoder).