My High-level Understanding of Attention:
Do I understand it correctly that the attention mechanism simply decouples each input ‘token’ from each other? And that it works by: encoding any relevant non-local context (and positional information) into each token’s representation/embedding. Where it appears each head is responsible for encoding only 1 ‘feature’ of the context of each ‘token’.
Relationship to CNNs:
And if so… is an individual attention head analogous to a filter in CNNs? Insofar as each filter searches for a different ‘feature’ in the previous ‘level of encoding’ (e.g. 1 filter to detect circles, just like 1 head to find subject of adjectives)?
Or since the dense layers used after the attention heads are time-distributed (aka convolved) is it more appropriate to think of attention head + following convolved dense layer as a ‘filter’?