Self-Attention in transformer network

** In this example l’Afrique is x^3 and the attention is being computed for this word A^3(l’Afrique). In the image above Andrew Ng indicates that the word with the biggest wieght in the computation of A^3 will be visite not l’Afrique itself because visite gives more context to l’Afrique. So when all the V values are summed A^3 will contatin more of visite embedding than l’Afrique embedding. So my question is this theoretically if vanishing gradients did not exist would a residual block still be required for a transformer network to work because it would need to maintain the information of the original word x^3 and combine that with A^3 since A^3 has more information about X^2 than X^3 ?**

@paulinpaloalto do you by any chance have any insight on this ?

Hi, Stephano.

Sorry, but I have just recently completed Week 3 of DLS C5 and have been intending to start Week 4 for the last few days. Of course there is no guarantee that I will be able to help even once I do get to that section, but I am tracking this thread. I’ll check back if no-one else answers by the time I get to that part.

But there are plenty of other mentors who know all this material, so you have a good chance that someone else will answer soon.


1 Like

Thank you as always paul.

Ok, that’s the second lecture and I just got done listening to it. In this context, he has not yet said anything about vanishing gradients as an issue with Transformers, so I think you’re getting ahead of things here. And my take on what he said is not that A^{<3>} doesn’t contain much information from x^{<3>}, but that the word “visite” provides the most “context” for how the word “l’Afrique” is used in that sentence: as a place that is being visited. So that allows us to select the best way to interpret the word “l’Afrique” in this sentence.

So my suggestion is to “hold that thought” about vanishing gradients and not try to implement ResNets here yet until we see how the multi-headed attention plays out in the next lecture.

1 Like

Ok, well, I don’t claim to understand the last lecture with the full Transformer Architecture yet, but he does mention Residual Net architecture in the lecture. But it doesn’t sound so much like the problem is vanishing gradients, but that he wants to propagate the positional encoding information through the network.

But as I said, having just watched the lecture for the first time, I am 100% sure I do not really understand much of it yet. My current plan is to take a look at the quiz and the programming assignment to get a bit more context and then my guess is I will need to rewatch the lectures to get through that.

1 Like

I don’t think “vanishing gradients” have anything to do with this issue.

Have you done the programming exercise yet? Where are the Residual Blocks actually used in the architecture? They are only mentioned briefly in the lectures and you need to look at the actual construction of the model to learn more. I’m not claiming to understand it all yet. I’ve started the programming assignment and scanned through to see where they mention the residual layers, but have not actually gotten to that part yet.

1 Like

so @TMosh is it because the we want the network to take in information about the context of the input word (A^3) and the embedding of the input word aswell (x^3)? Thats why the residual block is used ?

If you screen search on the programing exercise and type “residual” you can see where andrew mentions the residual blocks. He says in the programing exercise that the residual layer is for “normalization to help speed up training” so I guess without the residual layer the network would still work. Just confused on how the network would output the correct translation without a residual block(since andrew said it is only for speed) if the Attention doesn’t model doesnt have the complete X embedding since the Attention could be a encoding that contains most of the previous word like visite so when the model goes to output a translation then how will it know to output Africa instead of visite because the Attention contains most of the visite embedding ?

The “skip layer” and the “normalization layer” serve different functions. You don’t need to refer to them as a single entity.

The skip layer helps keep the solution moving, for the same reasons discussed in the ResNet exercise.

Normalization always helps iterative gradient methods work more efficiently.

thanks !