You can interpret each linear transformation as communication of tokens.
When transforming the embeddings (x) into Q, K, and V, we can loosely understand each token as asking specific questions:
- query - What am I looking for?
- key - What do I have?
- value - What can I offer or contribute to the aggregation?
In a multi-head attention mechanism, each head has its own set of weights, enabling it to ask these questions independently and specialize in different aspects. This parallelism allows for a diversified exploration of the input data.
For example, in the provided illustration, the original token embeddings appear similar since they are positioned closely together. However, when each head transforms them, they move to different spaces determined by that particular head’s focus.
For instance, in Head 1, the query “du” shifts to the left and upwards. As a result of this transformation, “du” becomes more similar to “tea” rather than “for” in the original space. Therefore, in this instance, Head 1 attends to the tokens “tea” and “it’s” and aggregates their corresponding values. Consequently, in Head 1 “du” would be represented as a purple star:
In Head 2 of the illustration, the queries and keys exhibit significant misalignment, making it difficult to discern what Head 2 is specifically looking for in relation to “du” and what it will aggregate as a result.
In summary, the advantage of employing multiple heads instead of a single head lies in the ability to explore different aspects, condensed into smaller questions and answers, within the input data. This facilitates a more comprehensive analysis compared to relying on a single, overarching question and answer approach.
Cheers