C5W4 - In need of attention regarding 'multi-head attention'

Nevermnd · May 30, 2024, 5:53pm

So, I must confess, the models here (unless it is just me) have started to become really complicated;

I guess I can generally understand Prof. Ng’s analogy to the database model regarding Q, K, V.

Yet what I find especially confusing is how Q is determined-- I mean yes, if somehow you could, in a sense, manually specify an appropriate query for every single word-- But my understanding of the model is that it itself, through training is supposed to determine that–

Though the level of depth and understanding in terms of the questions (or queries) he suggests almost seems we already ‘know’ what the sentence actually means in the first place, which like the ‘chicken and egg’ problem, would seem highly self-referential.

Nor does it seem as if we have quite enough knowledge from the word embeddings alone to determine this…

So I am a bit confused.

*This would seem to get even more complicated if your ‘token’ is not even a complete word.

paulinpaloalto · May 30, 2024, 7:00pm

You’re definitely not wrong that the water is getting pretty deep here.

Yes, just the word embeddings are not enough. We have training data, right? With the input sentences and expected output sentences (assuming the translation use case). That’s what the model learns from.

Weberson_Pontes · May 30, 2024, 7:18pm

This course is not easy to completely understand only watching videos. In fact, I had to watch videos, do the exercises, search documentation and other external materials as well. One picture that helped me to understand self-attention was this one below. That clarified a lot of doubts that I had. I hope it helps.

Nevermnd · May 30, 2024, 7:50pm

@Weberson_Pontes thanks, can you provide a reference as to where you got this ? I know of course there is the seminal paper, but I have also been searching for texts. I mean Ng’s lecture does make the paper ‘readable’. My traditional ‘go to’ Goodfellow, et al. (‘Deep Learning’) unfortunately coming out in 2016 doesn’t seem to have anything on attention, even at the pre-transformer level.

And there are enough TF references out there now if you want to just build the thing… But I still need to ‘hand hold’ myself through the conceptual understanding, I think…

Weberson_Pontes · May 30, 2024, 8:18pm

I was doing a research and found this image by chance.

TMosh · May 30, 2024, 11:55pm

The figure comes from this Wikipedia article:

Topic		Replies	Views
Self-Attention and Multi-Head Attention YouTube Video NLP with Attention Models week-4	1	535	January 26, 2023
C4_W1 Quiz Quality for Neural Machine Translation NLP with Attention Models week-1	1	316	December 16, 2023
Course 5 - Week 4 - A1 - Exercise 4 - EncoderLayer Sequence Models week-4 , coursera-platform	2	40	August 13, 2024
Course 5 Week 4 - Transformer Networks mechanics Sequence Models coursera-platform	1	507	April 21, 2022
C5W4 Query analogy for weight matrices Sequence Models coursera-platform	10	701	March 25, 2023

C5W4 - In need of attention regarding 'multi-head attention'

Related topics