Self-Attention in transformer network

Stephano_Cotsoradis · December 27, 2023, 5:51pm

** In this example l’Afrique is x^3 and the attention is being computed for this word A^3(l’Afrique). In the image above Andrew Ng indicates that the word with the biggest wieght in the computation of A^3 will be visite not l’Afrique itself because visite gives more context to l’Afrique. So when all the V values are summed A^3 will contatin more of visite embedding than l’Afrique embedding. So my question is this theoretically if vanishing gradients did not exist would a residual block still be required for a transformer network to work because it would need to maintain the information of the original word x^3 and combine that with A^3 since A^3 has more information about X^2 than X^3 ?**

Stephano_Cotsoradis · December 27, 2023, 8:04pm

@paulinpaloalto do you by any chance have any insight on this ?

paulinpaloalto · December 27, 2023, 8:45pm

Hi, Stephano.

Sorry, but I have just recently completed Week 3 of DLS C5 and have been intending to start Week 4 for the last few days. Of course there is no guarantee that I will be able to help even once I do get to that section, but I am tracking this thread. I’ll check back if no-one else answers by the time I get to that part.

But there are plenty of other mentors who know all this material, so you have a good chance that someone else will answer soon.

Regards,
Paul

Stephano_Cotsoradis · December 27, 2023, 9:33pm

Thank you as always paul.

paulinpaloalto · December 27, 2023, 10:54pm

Ok, that’s the second lecture and I just got done listening to it. In this context, he has not yet said anything about vanishing gradients as an issue with Transformers, so I think you’re getting ahead of things here. And my take on what he said is not that A^{<3>} doesn’t contain much information from x^{<3>}, but that the word “visite” provides the most “context” for how the word “l’Afrique” is used in that sentence: as a place that is being visited. So that allows us to select the best way to interpret the word “l’Afrique” in this sentence.

So my suggestion is to “hold that thought” about vanishing gradients and not try to implement ResNets here yet until we see how the multi-headed attention plays out in the next lecture.

paulinpaloalto · December 27, 2023, 11:20pm

Ok, well, I don’t claim to understand the last lecture with the full Transformer Architecture yet, but he does mention Residual Net architecture in the lecture. But it doesn’t sound so much like the problem is vanishing gradients, but that he wants to propagate the positional encoding information through the network.

But as I said, having just watched the lecture for the first time, I am 100% sure I do not really understand much of it yet. My current plan is to take a look at the quiz and the programming assignment to get a bit more context and then my guess is I will need to rewatch the lectures to get through that.

TMosh · December 27, 2023, 11:54pm

I don’t think “vanishing gradients” have anything to do with this issue.

paulinpaloalto · December 28, 2023, 2:58am

Have you done the programming exercise yet? Where are the Residual Blocks actually used in the architecture? They are only mentioned briefly in the lectures and you need to look at the actual construction of the model to learn more. I’m not claiming to understand it all yet. I’ve started the programming assignment and scanned through to see where they mention the residual layers, but have not actually gotten to that part yet.

Stephano_Cotsoradis · December 28, 2023, 1:08pm

so @TMosh is it because the we want the network to take in information about the context of the input word (A^3) and the embedding of the input word aswell (x^3)? Thats why the residual block is used ?

Stephano_Cotsoradis · December 28, 2023, 1:20pm

If you screen search on the programing exercise and type “residual” you can see where andrew mentions the residual blocks. He says in the programing exercise that the residual layer is for “normalization to help speed up training” so I guess without the residual layer the network would still work. Just confused on how the network would output the correct translation without a residual block(since andrew said it is only for speed) if the Attention doesn’t model doesnt have the complete X embedding since the Attention could be a encoding that contains most of the previous word like visite so when the model goes to output a translation then how will it know to output Africa instead of visite because the Attention contains most of the visite embedding ?

TMosh · December 28, 2023, 7:07pm

The “skip layer” and the “normalization layer” serve different functions. You don’t need to refer to them as a single entity.

The skip layer helps keep the solution moving, for the same reasons discussed in the ResNet exercise.

Normalization always helps iterative gradient methods work more efficiently.

Stephano_Cotsoradis · December 28, 2023, 9:51pm

thanks !

Topic		Replies	Views
Understanding Transformer Network Sequence Models coursera-platform	1	558	July 29, 2021
Something is wrong in the Decoder Block (of the Week2 ): Contradiction with the paper "Attention is all you need" NLP with Attention Models week-module-2	6	699	January 31, 2022
Course 5 Week 4 Transformer Padding Mask Sequence Models coursera-platform	2	531	September 22, 2021
Week 4: Transformer Network (test time intuition) Sequence Models coursera-platform	1	516	April 21, 2022
C4W2, Why is it easier for residual block to learn identity function? Convolutional Neural Networks coursera-platform	4	566	July 29, 2022

Self-Attention in transformer network

Related topics