—Synopsis:
I have completed the first 3 NLP Specialization courses, and I am now in Course 4 (I have completed Module 1: NMT with attention: it is awesome!): really very interesting Specialization. But, I detected contradictions in Module 2 of course 4 (and I can really talk about it because I have already done the new module 4 added by Andrew Ng, Younes and Kian: “Transformers” in the updated Deep Learning specialization (course5), and I have already spent a lot of time reading the paper: “Attention is all you need” and I have already done the Final Assignment (" Transformers" (course5) and I had full marks)…But maybe I’m not right or something is missing…
And Thank you for enlightening me if I did not understand well ( if I am wrong)
And I will, below, explain the contradictions that I detected in course 4 of NLP specialization (Module 2):
—First:
In Module 2 (Course 4 : Natural Language Processing with Attention Models): in the video called : “Transformer Decoder” (Below, the link)
And here what I want to talk about (Something wrong in the Decoder Block):
Already, by reading the paper “Attention is all you need”, and comparing the Image above (from the course Module 2), with that of the paper “below”, we detect that something is wrong (in the decoder block of the course , see the two images (the decoder blocks))
In fact, in the block decoder explanation (of the course), it was not mentioned anywhere how the two outputs (K, V) from the encoder (The Key, Value coming from the Encoder), will be added to the 2nd MultiheadAttention of the decoder). If we look at the the decoder Architecture of the course: it looks like, we are talking about the Encoder (but just by adding at the end: Linear layer and Softmax layer (but the inner block is that of the Encoder and not of the Decoder).
In fact , in the Decoder block, we have two (02) MultiHeadAttention (not only one as it was explained in the video of the course of “Transfromer Decoder”) :
-
The first MultiHeadAttention (Decoder self Attention )
-
The second MultiHeadAttention (Encoder-Decoder Attention: where the Key, Value come from The Encoder and The Query comes from the Decoder)
—Second:
In the video called “Causal Attention” :
https://www.coursera.org/learn/attention-models-in-nlp/lecture/AMz8y/causal-attention
Here the Image from the course:
It was said: In causal attention, queries and keys come from the same sentence. But, I don’t see the relation with the Transformer : I mean: I think it has not been put in its context : in The Transformer Architecture. (Really, I am Sorry, but I say this because I don’t see its relation with the videos before and after in the same module)
I Think to to put it in its context, we should talk about the 1st MultiHeadAttention of the Decoder Block (The Masked MultiHead Attention): this is where; we want that each prediction cannot attend the future.
In fact, Causal Attention, intervenes in the Transformer(with the Architecture Encoder-Decoder) , whatever the case study, because it is in The Decoder Block at “The Masked Mutli Head Attention” that it intervenes (Always), but never in the Encoder Block.
Thank you for enlightening me if I did not understand well (and if I am wrong)