Comparing the models for W2 and W3

Hi @PZ2004

That is correct, this assignment uses the decoder-only approach for summarization.

That is false. Bert is the encoder only.

Usually, the performance you achieve. You can try both approaches and see what is best for you (computation wise, accuracy, etc.).

Usually, the summarization is done with encoder and decoder transformers (not like the one in the assignment). In that case (encoder-decoder) the encoder gets the input of the text, and the decoder outputs only the summary part.

In C4W1, we were presented with the encoder-decoder translation, where English text was input for the encoder and German text was the target (the decoder’s job). Similarly, we could have used the same architecture to input the text/document into the encoder, and ask the decoder to output the summary. But, I guess, the course creators for this (next, C4W2) week wanted to introduce the decoder-only architecture (like in the GPT) and the way to implement the summarization with it.

In the C4W2 Assignment the special token is used to separate the text from the summary. And also the mask is used to not penalize the model for not getting the text part correct or wrong, so only the summary part is important. As I mentioned, this is decoder-only model.

In the C4W3 Assignment we implement the encoder-only model (actually, one part (the Unsupervised denoising part) of the T5 which is actually the encoder-decoder model). So, in the Assignment the model is trained to predict only sentinels. You can find out more about the T5 in the paper or maybe more concretely here.

Cheers