Multi-head attention

Jiexi_Liao · February 14, 2025, 6:21pm

week 1.
Question: how is multihead attention achieved - is it a parallel computing thing or totally different?

Igor_Pereverzev · February 15, 2025, 8:37am

Multi-head attention in models like the transformer architecture is indeed achieved through a form of parallel computation, but it’s not just about parallelism it also involves a distinct mechanism that allows the model to learn different types of relationships within the data.

You have equation

Attention(Q,K,V)=softmax(QK^T)/sqrt(d))* V

Where Query (Q), Key (K), and Value (V). These projections are linear transformations using learned weight matrices W(Q), W(K)and W(V).

In multi-head attention extends this idea by performing multiple attention computations in parallel. Instead of having one set of Q, K, and V, the input is projected into multiple sets (heads). Each head learns to attend to different parts of the input, capturing different relationships or patterns.

Jiexi_Liao · February 17, 2025, 4:18pm

thank you. so it could be done serially as well, just much slower, right?

Igor_Pereverzev · February 17, 2025, 8:21pm

In theory you could compute the attention heads one after another like sequence instead of in parallel and the final result would be the same. However, doing so would be much less efficient. The parallel computation isn’t just for speed it’s also designed so that each head can learn different aspects of the input simultaneously.

Topic		Replies	Views
MultiHeaded Attention Head Differentiation Sequence Models	1	472	May 3, 2023
Multiheaded Attention - Number of heads and Dim of heads NLP with Attention Models week-2	13	2885	May 7, 2023
Question about multi-head attention Sequence Models	2	621	June 25, 2021
Multi-headed Attention the mathematical meaning NLP with Attention Models week-2	6	684	November 8, 2023
Transformers (Multi-head Attention) question AI Discussions	17	151	September 5, 2023

Multi-head attention

Related topics