Hello,
In what way is Transformer like “Attention + CNN”? For example, in the Self-Attention video, can we consider the q and k as filters and A as the feature learned by the filter? In other words, in a CNN layer with Nc filters, each filter will earn a different feature of the input image.
In Andrew’s video, the A<3> feature is intended to answer the question “What’s happening in Africa” but there is no guarantee that the q<3> and k<3> filters will end up computing that exact feature, correct? For a given word “i” in the sentence, q is represents some learned query about word “i” that determines its word-embedding A, and figures out another word “j” whose k has the greatest impact on the learned q and A. Is my understanding correct?
In the self-attention step, does the Transformer algorithm compute q, k, A for each word “i” in the sentence in parallel or is this done sequentially?
Assuming my understanding so far is correct, does multi-head compute a new question and word embedding feature for the same word? i.e. each head of the “multi-heads” represents a different “Conv” layer? Within each conv layer or self-attention head, each q, A, k tuple is equivalent to one of the Nc filters in a given convolution layer? In this case, is Nc = number of words in the input sentence?
If I misunderstood the intuition behind the self-attention and multi-head steps, please clarify.
–Rahul