Something is wrong in the Decoder Block (of the Week2 ): Contradiction with the paper "Attention is all you need"


I have completed the first 3 NLP Specialization courses, and I am now in Course 4 (I have completed Module 1: NMT with attention: it is awesome!): really very interesting Specialization. But, I detected contradictions in Module 2 of course 4 (and I can really talk about it because I have already done the new module 4 added by Andrew Ng, Younes and Kian: “Transformers” in the updated Deep Learning specialization (course5), and I have already spent a lot of time reading the paper: “Attention is all you need” and I have already done the Final Assignment (" Transformers" (course5) and I had full marks)…But maybe I’m not right or something is missing…
And Thank you for enlightening me if I did not understand well ( if I am wrong)

And I will, below, explain the contradictions that I detected in course 4 of NLP specialization (Module 2):


In Module 2 (Course 4 : Natural Language Processing with Attention Models): in the video called : “Transformer Decoder” (Below, the link)

And here what I want to talk about (Something wrong in the Decoder Block):

Already, by reading the paper “Attention is all you need”, and comparing the Image above (from the course Module 2), with that of the paper “below”, we detect that something is wrong (in the decoder block of the course , see the two images (the decoder blocks))

In fact, in the block decoder explanation (of the course), it was not mentioned anywhere how the two outputs (K, V) from the encoder (The Key, Value coming from the Encoder), will be added to the 2nd MultiheadAttention of the decoder). If we look at the the decoder Architecture of the course: it looks like, we are talking about the Encoder (but just by adding at the end: Linear layer and Softmax layer (but the inner block is that of the Encoder and not of the Decoder).
In fact , in the Decoder block, we have two (02) MultiHeadAttention (not only one as it was explained in the video of the course of “Transfromer Decoder”) :

  1. The first MultiHeadAttention (Decoder self Attention )

  2. The second MultiHeadAttention (Encoder-Decoder Attention: where the Key, Value come from The Encoder and The Query comes from the Decoder)


In the video called “Causal Attention” :

Here the Image from the course:

It was said: In causal attention, queries and keys come from the same sentence. But, I don’t see the relation with the Transformer : I mean: I think it has not been put in its context : in The Transformer Architecture. (Really, I am Sorry, but I say this because I don’t see its relation with the videos before and after in the same module)

I Think to to put it in its context, we should talk about the 1st MultiHeadAttention of the Decoder Block (The Masked MultiHead Attention): this is where; we want that each prediction cannot attend the future.

In fact, Causal Attention, intervenes in the Transformer(with the Architecture Encoder-Decoder) , whatever the case study, because it is in The Decoder Block at “The Masked Mutli Head Attention” that it intervenes (Always), but never in the Encoder Block.

Thank you for enlightening me if I did not understand well (and if I am wrong)

I come back. In fact, I have just accessed the new version of the course (is if you are already enrolled in the course (in progress), even if you reset deadlines: it will not work, you have to contact coursera Help center(via Chat) and ask them to unenroll you then you enroll again) , just just a little thing: you will lose all your progress and you will have to start again from scratch (even the assignments) . It’s Ok now : I see a lot of videos that have been added, the same for the 2nd module (the subject of my post, above): a lot of videos have been added…and the Module 3 (a lot, a lot of videos and hands-on labs).

Otherwise, My post above is based on the old version of the course, which has now been enriched / If anyone is reading my post above (which I couldn’t delete on discourse platform), it was for the old version of the course / And concerning the video “Transformer Decoder” of the course, it is not about the block 2 of the original architecture of the Transformer of “Attention is all you need” (it is another architecture used for the summarization tasks, composed of a single block (just the name is misleading), but it is the GPT-2 Model.

Hey @Abboura , I believe I’m taking the updated course ( I was un-enrolled and then enrolled by Coursera help center), but my Week 2 looks like your original post. So I only see one multi-head attention block instead of two, and in the assignment, we only implemented the picture from your first post, not like in the paper.

I thought I was in the new version of the course, but maybe not? Could you tell me how these parts were updated?


@mazatov : Yes, GPT-2 is one block (it is a decoder). I already mentioned it in my comment above.
And to make sure that it is the new version of the course: there is a revealing clue that shows that this is the new version: At the beginning of each video, it is Younes who presents the video (and not Lucaz): I mean the videos of the module (not the 1st video of each week)/ because in the old version:Lucaz introduces each video (in one or two seconds), followed by Younes’ explanation of the learning material…

Thanks @Abboura , looks like I do have the new version then!

I think I’m going to retake the Week3/4 of Course 5 of Deep Learning Specialization first. I didn’t know they got an update with transformers. I can easily do the assignments of this course but I don’t feel like I fully grasp the topic after doing them.

1 Like

Not at all, @mazatov and yes, I strongly recommend that you retake Course 5 of the Deep Learning Specialization (updated this year): Week 4: new learning material (transformers), and you can review Week 3 (LSTMs with attention) + the assignment.

1 Like