Week 4 Multi-Head Attention Video

what do i and j stand for in the week 4 video (slide 7 of 11) ?

The equation you’ve shown is the same as softmax(Q . K^T) . V for one query term.

Apply softmax to the dot product of q and K. This helps to know how strong is the relationship between the current query term and each of the keys.

Multiplying each entry of the softmax term with the corresponding index of V and summing the results is the same as the 2nd dot product.

The only thing that might confuse you is the lack of \frac{1}{\sqrt{d_k}} term. Rest is correct.

1 Like