Notation inconsistency between videos 2 and 3 in Transformers

Catalin_Sapariuc · August 23, 2022, 1:44pm

Hello,
I have found somewhat of an inconsistency between video 2 (Self- Attention) and video 3 (Multi-Head Attention) of the course which makes me wondering which equation is actually correct …
In video 2, time 4:43, the equations which involve W^Q (I apologize for the somewhat sloppy notation) is: (1) q^<3> = W^Q x^<3> and so on for k^<3> = W^k x^<3> and v^<3> = W^v x^<3> , which define the matrices W^Q, W^k and W^v (for the first time), however, in video 3, time 1:27 (and I assume later throughout the video) we have (apparently, since the equation is not written explicitely): (2) q^<1> = W_1^Q q^<1> and so on, so there is a new matrix now W_1^Q which seems to be very different than W^Q from video 2. One explanation could be to replace (2) with (3) x^<1> = W_1^Q q^<1> , x^<1> = W_1^K K^<1> and similar for W_1^v, and still:
(4) x^<2> = W_1^Q q^<2> and so on. My main (related) questions are:
– are equations (3) - (4) close to correct ?
– If not , how do we define W_1^<Q, k, v> and in general W_j^<Q, k, v> ?

Thank you for your help.

TMosh · August 24, 2022, 11:13pm

This issue is an open discussion among the course staff, and an update to the course materials is possible.

I believe the equations in the Self-Attenuation video are more correct than the ones in the MHA video.

Topic		Replies	Views
C5W4 Transformer multi-head weight matrices Sequence Models	4	809	June 30, 2022
Learning q, k, v in self-attention and multihead attention Sequence Models	1	544	January 26, 2023
C5W4 Multi-head attention Sequence Models	4	692	May 10, 2023
Question about attention slides Sequence Models	1	502	August 25, 2022
How is Self Attention Q=Wx related to multi-head attention WQ Sequence Models	8	538	October 18, 2022

Notation inconsistency between videos 2 and 3 in Transformers

Related topics