Multi-Head Attention - Question about the slide

Multi-Head Attention | Coursera

Although explained in a note at 7:10 of the video, but I still cannot understand why in the multi-head attention we use (Wi_Q)(q<1>) for each self-attention unit, although in the previous video for self-attention we used (q<3>) = (W_Q)(X<3>). The explanation in the additional note is not clear to me.


1 Like

Hello @haleh,

I think we can say that they have used the symbols in two different ways because from this part of your screenshot,

image

they obviously considered q = x, which is not the case in the other video.

I would just take it as two different symbol definitions.

Cheers,
Raymond