Hello,
I have found somewhat of an inconsistency between video 2 (Self- Attention) and video 3 (Multi-Head Attention) of the course which makes me wondering which equation is actually correct …
In video 2, time 4:43, the equations which involve W^Q (I apologize for the somewhat sloppy notation) is: (1) q^<3> = W^Q x^<3> and so on for k^<3> = W^k x^<3> and v^<3> = W^v x^<3> , which define the matrices W^Q, W^k and W^v (for the first time), however, in video 3, time 1:27 (and I assume later throughout the video) we have (apparently, since the equation is not written explicitely): (2) q^<1> = W_1^Q q^<1> and so on, so there is a new matrix now W_1^Q which seems to be very different than W^Q from video 2. One explanation could be to replace (2) with (3) x^<1> = W_1^Q q^<1> , x^<1> = W_1^K K^<1> and similar for W_1^v, and still:
(4) x^<2> = W_1^Q q^<2> and so on. My main (related) questions are:
– are equations (3) - (4) close to correct ?
– If not , how do we define W_1^<Q, k, v> and in general W_j^<Q, k, v> ?
Thank you for your help.