Intuition behind Transformer = Attention + CNN

rvh · June 6, 2021, 9:02pm

Hello,

In what way is Transformer like “Attention + CNN”? For example, in the Self-Attention video, can we consider the q and k as filters and A as the feature learned by the filter? In other words, in a CNN layer with Nc filters, each filter will earn a different feature of the input image.

In Andrew’s video, the A<3> feature is intended to answer the question “What’s happening in Africa” but there is no guarantee that the q<3> and k<3> filters will end up computing that exact feature, correct? For a given word “i” in the sentence, q is represents some learned query about word “i” that determines its word-embedding A, and figures out another word “j” whose k has the greatest impact on the learned q and A. Is my understanding correct?

In the self-attention step, does the Transformer algorithm compute q, k, A for each word “i” in the sentence in parallel or is this done sequentially?

Assuming my understanding so far is correct, does multi-head compute a new question and word embedding feature for the same word? i.e. each head of the “multi-heads” represents a different “Conv” layer? Within each conv layer or self-attention head, each q, A, k tuple is equivalent to one of the Nc filters in a given convolution layer? In this case, is Nc = number of words in the input sentence?

If I misunderstood the intuition behind the self-attention and multi-head steps, please clarify.

–Rahul

manifest · June 7, 2021, 7:34am

Hey @rvh,

Your intuition is correct. The transformer networks are similar to CNNs in a sense that they are able to process inputs in parallel.

The query initiates the look-up. It identifies a word that we match agains other keys – the current focus of attention.
The key represents an identificator for a word, the query is matched against the keys.
The value is a contextual representation of the word.

Matching a query to keys means that for each key/query pair we calculate an attention weight. The weights represent probabilities of a key being a match for the query. Higher value of the attention weight means that the value behind a key is more relevant to the question.

We multiply attention weights by values to get an attention-based vector representation for each word.

It’s possible to compute attention without any loop.

Screen Shot 2021-06-07 at 09.29.42

Each head of the multihead self-attention has its own set of parameters and can be computed in parallel. Given these distinct sets of parameters, each head can learn different aspects of the relationships that exist among inputs.

Topic		Replies	Views
How magical is the Transformer NLP with Attention Models week-2	4	613	January 29, 2022
Understanding Transformer Network Sequence Models coursera-platform	1	558	July 29, 2021
Attention understanding & relationship to CNNs? Sequence Models coursera-platform	1	524	May 10, 2022
Course 5 Week 4 - Transformer Networks mechanics Sequence Models coursera-platform	1	507	April 21, 2022
The Matrix Math for self-attention Attention in Transformers: Concepts and Code in Py	4	65	February 22, 2025

Intuition behind Transformer = Attention + CNN

Related topics