C5W4 - In need of attention regarding 'multi-head attention'

So, I must confess, the models here (unless it is just me) have started to become really complicated;

I guess I can generally understand Prof. Ng’s analogy to the database model regarding Q, K, V.

Yet what I find especially confusing is how Q is determined-- I mean yes, if somehow you could, in a sense, manually specify an appropriate query for every single word-- But my understanding of the model is that it itself, through training is supposed to determine that–

Though the level of depth and understanding in terms of the questions (or queries) he suggests almost seems we already ‘know’ what the sentence actually means in the first place, which like the ‘chicken and egg’ problem, would seem highly self-referential.

Nor does it seem as if we have quite enough knowledge from the word embeddings alone to determine this…

So I am a bit confused.

*This would seem to get even more complicated if your ‘token’ is not even a complete word.

You’re definitely not wrong that the water is getting pretty deep here. :grinning:

Yes, just the word embeddings are not enough. We have training data, right? With the input sentences and expected output sentences (assuming the translation use case). That’s what the model learns from.

1 Like

This course is not easy to completely understand only watching videos. In fact, I had to watch videos, do the exercises, search documentation and other external materials as well. One picture that helped me to understand self-attention was this one below. That clarified a lot of doubts that I had. I hope it helps.


@Weberson_Pontes thanks, can you provide a reference as to where you got this ? I know of course there is the seminal paper, but I have also been searching for texts. I mean Ng’s lecture does make the paper ‘readable’. My traditional ‘go to’ Goodfellow, et al. (‘Deep Learning’) unfortunately coming out in 2016 doesn’t seem to have anything on attention, even at the pre-transformer level.

And there are enough TF references out there now if you want to just build the thing… But I still need to ‘hand hold’ myself through the conceptual understanding, I think…

I was doing a research and found this image by chance.

The figure comes from this Wikipedia article:

1 Like