Comparing the models for W2 and W3

PZ2004 · November 13, 2023, 12:28am

It seems the W2 summarization model works only on the decoder and there is not encoder part of the model? And the W3 Bert model has both the encoder and decoder included, am I correct?

What determines whether an encoder is needed or not? Also it is not clear in the W3 model how the encoder will feed into the decoder to guide the cross attention. Can you help explain?

W2 model

W3 model

arvyzukai · November 16, 2023, 11:00am

Hi @PZ2004

That is correct, this assignment uses the decoder-only approach for summarization.

That is false. Bert is the encoder only.

Usually, the performance you achieve. You can try both approaches and see what is best for you (computation wise, accuracy, etc.).

Usually, the summarization is done with encoder and decoder transformers (not like the one in the assignment). In that case (encoder-decoder) the encoder gets the input of the text, and the decoder outputs only the summary part.

In C4W1, we were presented with the encoder-decoder translation, where English text was input for the encoder and German text was the target (the decoder’s job). Similarly, we could have used the same architecture to input the text/document into the encoder, and ask the decoder to output the summary. But, I guess, the course creators for this (next, C4W2) week wanted to introduce the decoder-only architecture (like in the GPT) and the way to implement the summarization with it.

In the C4W2 Assignment the special token is used to separate the text from the summary. And also the mask is used to not penalize the model for not getting the text part correct or wrong, so only the summary part is important. As I mentioned, this is decoder-only model.

In the C4W3 Assignment we implement the encoder-only model (actually, one part (the Unsupervised denoising part) of the T5 which is actually the encoder-decoder model). So, in the Assignment the model is trained to predict only sentinels. You can find out more about the T5 in the paper or maybe more concretely here.

Cheers

PZ2004 · November 17, 2023, 2:20pm

In C4W3 we only implemented the encoder part of the model. However the assignment appears to be truncated, i.e. there is not actual model training and prediction, so it was not clear whether we wrote the entire model with encoder, or only a partial model. Since it was said that “bert” is bi-directional, how do we tell from the structure of the encoder? Is it in that the attention wasn’t masked, so the model can see both before and after the sentinels?

arvyzukai · November 18, 2023, 7:49am

That is true. We did not implement the whole T5 model but only the part of it. But the part that was implemented could be the whole model for some application.

Correct (just a mall detail - in BERT case, these are “masked tokens” instead of “sentinels”), when predicting the masked token, the model “can see” all the words before and after, thus bi-directional.

Topic		Replies	Views
Input for Text Summarization NLP with Attention Models week-2	5	552	November 8, 2022
W4 - Assignment: Why do we only update the attention weights in the decoder, but not in the encoder? Sequence Models	2	534	December 2, 2022
Transformer decoder architecture in course 2 NLP with Attention Models week-2	11	382	April 30, 2024
Concatenation of input with summary NLP with Attention Models week-2	2	324	December 2, 2023
If GPT is decoder only architecture, how do they do classification task and vice-versa? GenAI with LLMs Resources	2	1063	August 10, 2023

Comparing the models for W2 and W3

Related topics