It is mentioned in the video on text summarization (“Transformer Summarizer”) that as input we provide the full text followed by the end of sentence token and concatenated with the summary (followed by another end of sentence token and padded if necessary). If we view summarization as a supervised task and as the model only the encoder part of the transformer (e.g. as in BERT) and if the input to that model is what I mentioned above, then what is the output to train the model on? It is more “natural” or “convenient” for me to think of an input as the article text (only) and the output the summary (only). Here, on the other hand there is no mention of what the output is. From the same video when explaining the cross entropy loss one can infer that the output is the same as the input by “ignoring the words of the article to be summarized”. This too is not easy to comprehend due to the overfitting that would take place.
I will appreciate if someone could please provide further intuition on this or guidance if I am (completely) missing something. Thank you in advance.
Hello @Davit_Khachatryan !
It is nice to hear from you . That’s a good question, the thing is these models are a pre-trained models , they were trained on general huge data that gives the model general features, so when we use them we need them to be more specific, and how we get that by training and test them on a data that looks like ours, afterwards we do the fine tuning then save our model with our features to test them on our data.
here is more about summarization from the hugging face hugging-face summarization
I hope this answers your question.
Have a nice day!!
It is more “natural” or “convenient” for me to think of an input as the article text (only) and the output the summary (only)
Yes, conceptually you can think of it in that way. But the implementation uses some tricks to get to that.
Here, on the other hand there is no mention of what the output is.
The output is the next token (word).
From the same video when explaining the cross entropy loss one can infer that the output is the same as the input by “ignoring the words of the article to be summarized”. This too is not easy to comprehend due to the overfitting that would take place.
Maybe it would make more sense if I give you a simplified example (here a_n is word n from an article (article with 79 words), s_n is summary word n (summary of 12 words), pad_n is padding):
input:
[a_1, a_2, a_3, ... ,a_{79}, <eos>, <sep>, s_1, s_2, s_3, ...,s_{12}, <eos>, pad_1 .. pad_8]
(not important, but additional note: in the Assignment <pad> is used instead of <sep>)
weights for the loss:
[0, 0, 0 .. 0, 0, 0, 1, 1, 1 .. 1, 1, 0 .. 0]
(note: all article words + <eos>, <sep> and padding are not penalizing the model if it predicts any of them wrong)
You can imagine the process in training as :
- First, the model predicts the first word (a_1);
- Then, it gets the correct word as input (not the one it predicted, aka teacher forcing) and tries to predict the second one (a_2);
- Then, it gets the the first two as input ([a_1, a_2]) and tries to predict the third one (a_3)
- This repeats all the way to the end (including first <eos>, <sep>, summary words, <eos> and padding).
- Then, when calculating the loss, you compare the actual words with the model’s assigned probabilities for that word.
[a1 == p1, a2 == p2, … , a79 == p79, == p80, == p81, s1 == p82, s2 == p83, … pad8 == p100]
The important part here is that if the model’s probability p1 is low for the actual word a1, this miscalculation is multiplied by 0. The same goes for the second, third and all the words up to the summary words - these losses are multiplied by 0. (This is the trick for “ignoring the words of the article to be summarized”)
(* note: sometimes, the weights for the loss of the article words are not 0, but some small amount like 0.01)
On the other hand, the models probability p_{82} for s_1 should be high because this loss is multiplied by 1. The same goes for other words of the summary (s_2 .. s_{12} and <eos>) - Then you sum up the losses and reduce/increase the model weights accordingly.
When doing inference (the actual predictions after training), you provide the input (the article words with <eos> and <sep> at the end) and the model tries to predict the next word. Now comes the difference from training - for the second word of the summary prediction, you provide the input and the first word that the model predicted.
Hope it helps. Cheers
@arvyzukai thanks for taking the time to respond. Unfortunately, your answer still does not address the reason or intuition as to why the model (during training) has to see the summary as part of the input. Is there any published evidence that this way the validation error is less than if this was done without letting the model see the summaries as part of the inputs? The example at the end of your response is detailed and I appreciate your time, but it merely exemplifies what appears at the end of the video in question without answering my main question.
Because the summary would not make any sense if you do not have the start of it. Remember - your predict one word at a time. If you do not have the start of the summary, how for example would you predict the word number 23 of the summary (if you do not know the previous 22)?
Thanks a lot for the clarification. It makes sense, I appreciate it.