Hi everyone,
In Attention Model, Andrew says
If you have Tx words in the input and Ty words in the output then the total number of these attention parameters are going to be Tx * Ty.
Why is that? I thought the whole idea of attention is that we can only look at a few words close by instead of the entire sentence.
The context window within an input / output pair refers to the entire sequence lengths. While this creates the need for one less hyper parameter (i.e. attention window length), the model can generalize better.
From a language perspective, it’s impossible to know where the references to be resolved are. For instance, the word “he” might be close to the word like “Adam” or might be really far off.
1 Like