Multi-headed Attention the mathematical meaning

From the lecture “Multi-head Attention”, I learned that the input embedding was transformed linearly into different representations. I don’t understand why this can be beneficial to the result over the single head. I guess more representation of the embeddings means better precision. But what’s the mathematical way of saying that? This question might be related to this question

Hi @Yuncheng_Liao

You can interpret each linear transformation as communication of tokens.
When transforming the embeddings (x) into Q, K, and V, we can loosely understand each token as asking specific questions:

  • query - What am I looking for?
  • key - What do I have?
  • value - What can I offer or contribute to the aggregation?

In a multi-head attention mechanism, each head has its own set of weights, enabling it to ask these questions independently and specialize in different aspects. This parallelism allows for a diversified exploration of the input data.

For example, in the provided illustration, the original token embeddings appear similar since they are positioned closely together. However, when each head transforms them, they move to different spaces determined by that particular head’s focus.
For instance, in Head 1, the query “du” shifts to the left and upwards. As a result of this transformation, “du” becomes more similar to “tea” rather than “for” in the original space. Therefore, in this instance, Head 1 attends to the tokens “tea” and “it’s” and aggregates their corresponding values. Consequently, in Head 1 “du” would be represented as a purple star:

In Head 2 of the illustration, the queries and keys exhibit significant misalignment, making it difficult to discern what Head 2 is specifically looking for in relation to “du” and what it will aggregate as a result.

In summary, the advantage of employing multiple heads instead of a single head lies in the ability to explore different aspects, condensed into smaller questions and answers, within the input data. This facilitates a more comprehensive analysis compared to relying on a single, overarching question and answer approach.

Cheers

Since the original d_model dimensions are divided among the h heads, each of the linear transformations is a projection into a smaller dimensional space. Neither the lecture nor the response above emphasizes this point; the lecture frames with reduction in dimension of the linear transformations simply as a way to ensure that multi-head attention doesn’t require h times as much computation.

Is the fact that the transformation are projections essential to ensure that the different heads learn different aspects of the relationship between the sequences (or the elements of the same sequence in the case of self-attention)? Or would we still get specialized heads if each linear transformation were a full d_model x d_model matrix without projection?

If the projection is essential, it seems that should be discussed in the lecture (or maybe an optional video/reading could be added?)

PS the claim that the linear transformation changes the distances between words in the embeddings also seems dependent on the fact that the illustrations are projections (here 2-d projections) of the original embedding space.

Hi @David_Fox

I’m sorry I missed your good questions probably because you did not mention my handle and I was on vacation at the time.

Yes, the point is not emphasized in the lectures. In order for Transformers architecture to work with residual connection the output of Attention block has to match dimensions of the previous block. So, as you correctly ask, if each head would work with full dimensions rather than reduced dimensions, the question would arise - when to reduce them? The obvious point would be position wise Feed Forward Network - but here would come in the increased calculation costs and are they worth it?

I think, through myriad of experimenting with different architectures for the block, the authors came to the current architecture (which essentially nobody improved with better results, maybe except for the layer norm) - each head operates with reduced dimensions.
Each head could operate with full dimensions but my intuition is that the additional computation costs are less worth compared to having more heads but with reduced dimensions, or having more layers… In other words, specialization for different aspects of language presumably do not require equally big of a space for each head, rather each head “picks” what is interesting for it in order to output its expertise.

And yes, projection is essential (be it full dimensions or reduced) in order for each head to “specialize”.

I’m not sure I fully understand this question. In this picture the projections use the same dimensions (2d in original and 2d for each head) so it is not what exactly happens in the Attention block, but it is just for simple illustration - how linear transformations could move around the original embeddings for each head in order for it to specialize in whatever aspect they would care.
In other words, the same “moving around” would happen in multi-dimensionality. For example, from 512 original space to 64 head space, each head would produce a different chart (3 charts) in 64 dimensional space while the original embeddings would be the same in 512 dimensional space.

Cheers

Here is my understanding and not sure if this is correct: Given that the multihead attention is trained together with the embedding, the model naturally partition the embedding vector into n_head sections and each section undergo different transformations then pieced back together in order to yield the best attention outcomes. Without the multiheaded attention, then the Q*K variance covariance matrix will largely be driven by the embedding alone. So the multihead operation just gives the model one more layer of matrix operation (additional flexibility, or more parameters to fit) in order to extract the attention more effectively.
Intuitively, does the attention look at what has been translated and what has not be translated and given the input and current output, find what words in the input we should focus on translating for the next step? knowing what words in the input to focus on, the decode then generate the next word, based on multiple normalization feedforward layers. I am not sure what the feedforward does exactly and would appreciate some insight.

I don’t think that you understand it correctly. The model does not “naturally partition the embedding vector into n_head sections” if I understand this correctly.
In other words, each head takes the whole embedding vector as input. To be concrete, if the embedding layer has 512 dimensions(units) then each head gets 512 dimensional vector (not a 64 dimensional or a certain slice of the embedding vector).
What each head does, it transforms this 512 dimensional vector into 64 dimensional Q, K, V (when n_head=8, 64 = 512 / 8) and operates in a lower dimensional space. In other words, each head moves around those dots (like in the original picture) in a 64 dimensional space to extract useful “attentions” or representations on which to focus.
Without the multi-head attention there would be one head which would try to extract useful representations “all at once” (in 512 dimensional space) or in other (loose) words, without subdividing (specialization) on what to focus on.

In the original (translation) paper there are three attentions (orange blocks):

  • the leftmost is usually called “Self Attention”
  • the right-bottom is usually called “Causal Attention”
  • the right-top is usually called “Cross Attention”

So intuitively each attention “has different goals”. On top of that each head (in multi-head) focuses on its own specialty.

For example, you can imagine that in the leftmost attention (Self Attention) each head tries to encode/represent the sentence preparing it for right-top (Cross Attention) to be consumed (don’t forget the following Feed Forward). Intuitively (but very loosely), one example could be that one head specializes in gender (does “it” refers to female/male/neutral so that the German translation (Cross Attention) could decide between “die / der / das”) another might focus on “pluralness” (is “you” a single person or multiple or royal; “du / ihr / Sie” in German).

What each head does here can be interpreted as pulling attention to some aspect of the English language and preparing it for “Feed Forward” (blue box, (FFN)) consumption. Usually, the FFN is multi layer neural network (in the original paper - a fully connected 2 layer feed-forward network, each layer 2048 dimensional).
Here (in the FFN), you can intuitively interpret that the first layer takes each head’s output as a slice (as input) and “thinks/decides” what to do with it.
In other words, the multi-head output is 512 dimensional (8 heads, each 64 dimensional output, 512 = 8 x 64) and the FFN now “knows” that the first 64 numbers are from one head (which, loosely speaking, might specialize in gender), the next 64 numbers are from another head and so on. Now, this FFN can make its decisions (through a two 2048 dimensional layers, note that input and output are still 512 dimensional) what is the best representation for Cross Attention so that its heads could work with this.

One more aspect(maybe unnecessary because it might confuse you further) that is not usually talked about is the Residual connection.
Looking at this (the above) diagram you can create a view/intuition that these black lines/arrows are of the same width/importance. But in reality the picture should look more like this (where residual lines would be thick):
image
source

In this view/intuition, the Attentions layers (with FFNs) would be on the right side as “add-ons” (nonetheless very important/magical) to the massive original stream from the original embeddings. In other words, if you trace what happens to the original embedding numbers they are “nudged” slightly with each layer but substantially enough that makes a big difference.

Well, this answer is too long… :slight_smile: I think I should stop here :slight_smile:
Cheers

Wow, this is super helpful. Thanks a million for taking the time to explain it in an intuitive fashion. It is a much more informative explanation of the model than the videos, which are sometimes made too short and simple.
Thanks again, Arvyzukai. you have made my day.

1 Like