Multi-head attention

week 1.
Question: how is multihead attention achieved - is it a parallel computing thing or totally different?

Multi-head attention in models like the transformer architecture is indeed achieved through a form of parallel computation, but it’s not just about parallelism it also involves a distinct mechanism that allows the model to learn different types of relationships within the data.

You have equation

Attention(Q,K,V)=softmax(QK^T)/sqrt(d))* V

Where Query (Q), Key (K), and Value (V). These projections are linear transformations using learned weight matrices W(Q), W(K)and W(V).

In multi-head attention extends this idea by performing multiple attention computations in parallel. Instead of having one set of Q, K, and V, the input is projected into multiple sets (heads). Each head learns to attend to different parts of the input, capturing different relationships or patterns.

1 Like

thank you. so it could be done serially as well, just much slower, right?

In theory you could compute the attention heads one after another like sequence instead of in parallel and the final result would be the same. However, doing so would be much less efficient. The parallel computation isn’t just for speed it’s also designed so that each head can learn different aspects of the input simultaneously.