“you will use a language model – Transformer Decoder – to solve an input-output problem. As you know, language models only predict the next word, they have no notion of inputs. To create a single input suitable for a language model, we concatenate inputs with targets putting a separator in between”
It is not clear how summary is generated from same input again and again ? Why is summary tagged along with input ? What is the intuition ? I am unable to understand this. Thanks
Hi @Bharath_Mukundakrish
That is a good question. This week’s task (summarization) is accomplished by decoder-only architecture - meaning, that only the generator is used (not the usual way, where encoder would encode the input, and the decoder would generate the summary).
To achieve summarization with decoder-only architecture, the special token (SEP) is used to separate (and indicate to the model) where the summary begins.
Let’s take a concrete trivial example:
“Very very long sentence with a lot of words <SEP> The summary <EOS>”.
Having this example, the model needs to learn to predict from from the inputs. For example, the model should learn to behave something like this (input → output),
- <BOS> → “The” (wrong prediction/penalized)
- <BOS> Very → “very” (correct prediction/rewarded)
- …
- <BOS> Very very long → “time” (wrong prediction/penalized)
- <BOS> Very very long sentence → “with” (correct prediction/rewarded)
- … etc.
- <BOS> Very very long sentence with a lot of words <SEP> → “The” (correct prediction/rewarded)
- <BOS> Very very long sentence with a lot of words <SEP> The → “sentence” (wrong prediction/penalized - should have been the word “summary”)
- <BOS> Very very long sentence with a lot of words <SEP> The summary → “<EOS>” (correct prediction/rewarded)
So the intuition could be that looping over a gazillion of examples of different sentences, the model learns to predict the correct tokens (“learns” the language and also learns to generate the summary after the special token).
Cheers
P.S. I should have also mentioned that in this week’s Assignment, the mask used is 0 for for all the words before the <SEP> token and 1s after it - so the model is penalized rewarded only on the summary part. This is not a must and a soft-mask (for example, 0.01 mask for the tokens before the <SEP> token and 1s for the summary part) could be used in order for the model to converge faster and more efficiently and, maybe, for even better results.
Is there a relation between eigen-vectors/eigen-values and the notion of summarization of a given document(s) ? If so, how does the relationship show up in summarization models ?
Also many of the concepts of Attention et. al., to me seem to have analogy in splines/basis functions, blossoming, convolution with basis splines in CAD (Computer Aided Design) World and differential calculus world.
Thanks