Intuition behind Transformer = Attention + CNN


In what way is Transformer like “Attention + CNN”? For example, in the Self-Attention video, can we consider the q and k as filters and A as the feature learned by the filter? In other words, in a CNN layer with Nc filters, each filter will earn a different feature of the input image.

In Andrew’s video, the A<3> feature is intended to answer the question “What’s happening in Africa” but there is no guarantee that the q<3> and k<3> filters will end up computing that exact feature, correct? For a given word “i” in the sentence, q is represents some learned query about word “i” that determines its word-embedding A, and figures out another word “j” whose k has the greatest impact on the learned q and A. Is my understanding correct?

In the self-attention step, does the Transformer algorithm compute q, k, A for each word “i” in the sentence in parallel or is this done sequentially?

Assuming my understanding so far is correct, does multi-head compute a new question and word embedding feature for the same word? i.e. each head of the “multi-heads” represents a different “Conv” layer? Within each conv layer or self-attention head, each q, A, k tuple is equivalent to one of the Nc filters in a given convolution layer? In this case, is Nc = number of words in the input sentence?

If I misunderstood the intuition behind the self-attention and multi-head steps, please clarify.


Hey @rvh,

Your intuition is correct. The transformer networks are similar to CNNs in a sense that they are able to process inputs in parallel.

  • The query initiates the look-up. It identifies a word that we match agains other keys – the current focus of attention.
  • The key represents an identificator for a word, the query is matched against the keys.
  • The value is a contextual representation of the word.

Matching a query to keys means that for each key/query pair we calculate an attention weight. The weights represent probabilities of a key being a match for the query. Higher value of the attention weight means that the value behind a key is more relevant to the question.

We multiply attention weights by values to get an attention-based vector representation for each word.

It’s possible to compute attention without any loop.

Screen Shot 2021-06-07 at 09.29.42

Each head of the multihead self-attention has its own set of parameters and can be computed in parallel. Given these distinct sets of parameters, each head can learn different aspects of the relationships that exist among inputs.

1 Like