How is Self Attention Q=Wx related to multi-head attention WQ

DTodd · October 13, 2022, 5:54am

In the self attention discussion it is mentioned that: Q = Wx, presumably a dot product of W and x where W is W superscript (Q).
In Multi head discussion, the term WQ is discussed (W superscript (Q)). Is the W in this presentation just the inverse of the W matrix presented in self attention?
i.e. multiply Q = Wx by Winv (often written as W superscript -1).
leading to
WinvQ = x, since WinvW = 1
i.e. more specifically since np.dot(Winv, W) = 1

Thanks, David

TMosh · October 13, 2022, 6:29am

Sorry, I don’t totally comprehend your notation.
Can you give a link or some screen captures to the part of the lecture you’re referring to?

DTodd · October 13, 2022, 3:49pm

I did not see a way to link to the slides, since the link just downloads them.
Also, the Q=Wx is only in the video, written in blue ink on the slides.

Q=Wx, etc : Self Attention video at 5:02, bottom right of slide.
Attention(W(q)Q, W(k)K, W(v)V), etc : Multihead Attention at 3:05, middle of slide.

Thanks,
David

TMosh · October 14, 2022, 4:50pm

Thanks for the references.

It will be a couple of days before I am able to reply further on this topic (going offline unexpectedly).

DTodd · October 18, 2022, 5:29am

OK. I will look forward to your feedback. Thanks.

TMosh · October 18, 2022, 5:39am

No, the W matrices aren’t inverses. I don’t think Andrew says that in the lectures.

TMosh · October 18, 2022, 5:44am

Reading the original research paper might be helpful in expanding on Andrew’s intuitive explanation.

DTodd · October 18, 2022, 11:54pm

Hi Tom,
I just surmised they might be, since multiplying by the inverse would eliminate the W from the right side of the equation leaving it simply multiplying the Q on the left side of the equation.
Thanks, David

DTodd · October 18, 2022, 11:55pm

Hi Tom,
And thank you for the link to the original article.
I’ll dig into that tonight.
Best,
David

Topic		Replies	Views
C5W4 Transformer multi-head weight matrices Sequence Models	4	818	June 30, 2022
C5 W4 multi-head attention Sequence Models	7	273	January 2, 2024
Learning q, k, v in self-attention and multihead attention Sequence Models	1	566	January 26, 2023
W4 A1 \| Is there a typo in Multi-head attention slides? Sequence Models	9	1405	November 10, 2022
C5W4 Multi-head attention Sequence Models	4	699	May 10, 2023

How is Self Attention Q=Wx related to multi-head attention WQ

Related topics