Attention understanding & relationship to CNNs?

dwyerfire · February 16, 2022, 5:16pm

My High-level Understanding of Attention:
Do I understand it correctly that the attention mechanism simply decouples each input ‘token’ from each other? And that it works by: encoding any relevant non-local context (and positional information) into each token’s representation/embedding. Where it appears each head is responsible for encoding only 1 ‘feature’ of the context of each ‘token’.

Relationship to CNNs:
And if so… is an individual attention head analogous to a filter in CNNs? Insofar as each filter searches for a different ‘feature’ in the previous ‘level of encoding’ (e.g. 1 filter to detect circles, just like 1 head to find subject of adjectives)?

Or since the dense layers used after the attention heads are time-distributed (aka convolved) is it more appropriate to think of attention head + following convolved dense layer as a ‘filter’?

reinoudbosch · May 10, 2022, 4:30pm

Hi dwyerfire,

Here’s my two cents.

I see attention as a feature extractor focusing on the input text as a whole, extracting relevant meaning features from value vectors. This is conceptually comparable to the use of filters in CNNs, although the value of attention also depends on the input vectors themselves and not solely on the parameters that create the values of the query, key, and value matrices. So in a sense, the input itself to some extent functions as a meaning feature selection filter.

The way I see it, the feed forward layers can be compared to fully connected layers in CNNs.

Topic		Replies	Views
Intuition behind Transformer = Attention + CNN Sequence Models coursera-platform	1	601	June 7, 2021
How magical is the Transformer NLP with Attention Models week-module-2	4	619	January 29, 2022
Explanation for attention_axes parameter in Keras' MultiheadAttention layer? Sequence Models coursera-platform	3	756	September 4, 2021
Questions regarding course 4 week 1 NLP with Attention Models week-module-1	1	582	August 3, 2022
What's the point of an RNN encoder in seq2seq models with attention? Sequence Models coursera-platform	3	576	June 7, 2022

Attention understanding & relationship to CNNs?

Related topics