Multi-Head Attention | Coursera
Although explained in a note at 7:10 of the video, but I still cannot understand why in the multi-head attention we use (Wi_Q)(q<1>) for each self-attention unit, although in the previous video for self-attention we used (q<3>) = (W_Q)(X<3>). The explanation in the additional note is not clear to me.