Understanding Transformer Network


I was going through lecture videos on Transformer Network of week 4. The first part of the combination i.e. self-attention signifies enriching the relationship between words through query key value concept. I want to understand, doesn’t the attention model already do that? I mean how it adds value here. Given the example, we are already taking into account the contribution of 'visite in ‘l’Afrique’ using a simple attention model by incorporating alpha<t,t’>. What does q, k, v do differently that alpha doesn’t?

Hi @piyush23 ,

I’d like to share my understanding below, and welcome any different ideas.
Attention” per se, the concept of the two is the same, but different implementation and application. The attention model in week 3 is additive attention. Instead, the Transformer in week 4 uses scaled dot-product attention. The Attention Is All You Need paper also compares these two implements.
In fact, you can think of attention model in the same way as q, k, v. For instance, alpha<t, t’> is calculated by q (i.e., S) and k (i.e., a<t’>), and then taking the attention weights (i.e., alpha<t, t’>) to multiply v (i.e., a<t’>).
Of course, just as you know, apart from attention mechanism, the NMT architecture in week 3, 4 is different.