Understanding Transformer Network

piyush23 · July 27, 2021, 4:02pm

Hi,

I was going through lecture videos on Transformer Network of week 4. The first part of the combination i.e. self-attention signifies enriching the relationship between words through query key value concept. I want to understand, doesn’t the attention model already do that? I mean how it adds value here. Given the example, we are already taking into account the contribution of 'visite in ‘l’Afrique’ using a simple attention model by incorporating alpha<t,t’>. What does q, k, v do differently that alpha doesn’t?

edwardyu · July 29, 2021, 3:56am

Hi @piyush23 ,

I’d like to share my understanding below, and welcome any different ideas.
“Attention” per se, the concept of the two is the same, but different implementation and application. The attention model in week 3 is additive attention. Instead, the Transformer in week 4 uses scaled dot-product attention. The Attention Is All You Need paper also compares these two implements.
In fact, you can think of attention model in the same way as q, k, v. For instance, alpha<t, t’> is calculated by q (i.e., S) and k (i.e., a<t’>), and then taking the attention weights (i.e., alpha<t, t’>) to multiply v (i.e., a<t’>).
Of course, just as you know, apart from attention mechanism, the NMT architecture in week 3, 4 is different.

Topic		Replies	Views
Course 5 Week 4 - Transformer Networks mechanics Sequence Models	1	507	April 21, 2022
Self-Attention formula Sequence Models week-4	1	153	May 1, 2024
Question on Transformers Sequence Models	3	531	July 16, 2023
Self-attention in the Transformer Network Sequence Models week-4	7	77	August 15, 2024
The Matrix Math for self-attention Attention in Transformers: Concepts and Code in Py	4	58	February 22, 2025

Understanding Transformer Network

Related topics