Multi-headed Attention the mathematical meaning

arvyzukai · May 29, 2023, 3:51pm

You can interpret each linear transformation as communication of tokens.
When transforming the embeddings (x) into Q, K, and V, we can loosely understand each token as asking specific questions:

query - What am I looking for?
key - What do I have?
value - What can I offer or contribute to the aggregation?

In a multi-head attention mechanism, each head has its own set of weights, enabling it to ask these questions independently and specialize in different aspects. This parallelism allows for a diversified exploration of the input data.

For example, in the provided illustration, the original token embeddings appear similar since they are positioned closely together. However, when each head transforms them, they move to different spaces determined by that particular head’s focus.
For instance, in Head 1, the query “du” shifts to the left and upwards. As a result of this transformation, “du” becomes more similar to “tea” rather than “for” in the original space. Therefore, in this instance, Head 1 attends to the tokens “tea” and “it’s” and aggregates their corresponding values. Consequently, in Head 1 “du” would be represented as a purple star:

In Head 2 of the illustration, the queries and keys exhibit significant misalignment, making it difficult to discern what Head 2 is specifically looking for in relation to “du” and what it will aggregate as a result.

In summary, the advantage of employing multiple heads instead of a single head lies in the ability to explore different aspects, condensed into smaller questions and answers, within the input data. This facilitates a more comprehensive analysis compared to relying on a single, overarching question and answer approach.

Cheers

Topic		Replies	Views
Is multy-head the same concept like beam search? NLP with Attention Models week-4	11	443	August 20, 2023
Multi-head attention Generative AI with Large Language Models conceptual-question	3	20	February 17, 2025
Multiheaded Attention - Number of heads and Dim of heads NLP with Attention Models week-2	13	3010	May 7, 2023
Multiheaded attention question Sequence Models week-4	1	266	January 6, 2024
C4_W2_Assignment Multi-Head Attention input prep NLP with Attention Models week-2	1	532	September 19, 2022

Multi-headed Attention the mathematical meaning

Related topics