Why is the Mask Size suggested to be (1 x L_q x L_q) ? Isn’t mask supposed to be (1 x L_q x L_v)? Can someone please clarify this.

Hi utkarsh_shukla2,

Have another look at the video ‘Masked Self Attention’. At 2:04 you see that the mask gets added to the dot product of the query and the transposed key divided by the square root of the encoding dimension of the key. The dimension of (Q\cdot{K.T})/\sqrt{d_k} is L_q by L_q, so the mask should have the same shape.