Attention understanding & relationship to CNNs?

My High-level Understanding of Attention:
Do I understand it correctly that the attention mechanism simply decouples each input ‘token’ from each other? And that it works by: encoding any relevant non-local context (and positional information) into each token’s representation/embedding. Where it appears each head is responsible for encoding only 1 ‘feature’ of the context of each ‘token’.

Relationship to CNNs:
And if so… is an individual attention head analogous to a filter in CNNs? Insofar as each filter searches for a different ‘feature’ in the previous ‘level of encoding’ (e.g. 1 filter to detect circles, just like 1 head to find subject of adjectives)?

Or since the dense layers used after the attention heads are time-distributed (aka convolved) is it more appropriate to think of attention head + following convolved dense layer as a ‘filter’?

Hi dwyerfire,

Here’s my two cents.

I see attention as a feature extractor focusing on the input text as a whole, extracting relevant meaning features from value vectors. This is conceptually comparable to the use of filters in CNNs, although the value of attention also depends on the input vectors themselves and not solely on the parameters that create the values of the query, key, and value matrices. So in a sense, the input itself to some extent functions as a meaning feature selection filter.

The way I see it, the feed forward layers can be compared to fully connected layers in CNNs.