Multi-Head Attention - Question about the slide

haleh · May 10, 2024, 7:33pm

Although explained in a note at 7:10 of the video, but I still cannot understand why in the multi-head attention we use (Wi_Q)(q<1>) for each self-attention unit, although in the previous video for self-attention we used (q<3>) = (W_Q)(X<3>). The explanation in the additional note is not clear to me.

rmwkwok · May 13, 2024, 2:53am

Hello @haleh,

I think we can say that they have used the symbols in two different ways because from this part of your screenshot,

they obviously considered q = x, which is not the case in the other video.

I would just take it as two different symbol definitions.

Cheers,
Raymond

Topic		Replies	Views
C5 W4 multi-head attention Sequence Models	7	273	January 2, 2024
C5W4 Multi-head attention Sequence Models	4	699	May 10, 2023
C5W4 Transformer multi-head weight matrices Sequence Models	4	818	June 30, 2022
How is Self Attention Q=Wx related to multi-head attention WQ Sequence Models	8	542	October 18, 2022
Course 5 Week 4 - Transformer Networks mechanics Sequence Models	1	507	April 21, 2022

Multi-Head Attention - Question about the slide

Related topics