Multi-Head Attention | Coursera

Although explained in a note at 7:10 of the video, but I still cannot understand why in the multi-head attention we use (Wi_Q)(q<1>) for each self-attention unit, although in the previous video for self-attention we used (q<3>) = (W_Q)(X<3>). The explanation in the additional note is not clear to me.

Hello @haleh,

I think we can say that they have used the symbols in two different ways because from this part of your screenshot,


they obviously considered q = x, which is not the case in the other video.

I would just take it as two different symbol definitions.