Quadratic cost of Attention

Zijun_Liu · September 30, 2024, 1:25am

Hi everyone,

If you have Tx words in the input and Ty words in the output then the total number of these attention parameters are going to be Tx * Ty.

Why is that? I thought the whole idea of attention is that we can only look at a few words close by instead of the entire sentence.

balaji.ambresh · September 30, 2024, 3:01am

The context window within an input / output pair refers to the entire sequence lengths. While this creates the need for one less hyper parameter (i.e. attention window length), the model can generalize better.

From a language perspective, it’s impossible to know where the references to be resolved are. For instance, the word “he” might be close to the word like “Adam” or might be really far off.

Topic		Replies	Views
Attention model video formula error? Sequence Models coursera-platform	3	553	December 15, 2021
C4_W1_Basic_Attnetion question NLP with Attention Models week-module-1	1	172	May 2, 2024
Attention Models Lecture Sequence Models coursera-platform	4	516	March 29, 2023
Gradient Descent for Attention Model Sequence Models coursera-platform	12	556	June 3, 2022
How do you build an Attention Model with more outputs then inputs Sequence Models coursera-platform	1	512	April 18, 2022

Quadratic cost of Attention

Related topics