Summation in self-attention

mc04xkf · September 11, 2021, 10:47am

self-attention:
A(q,K,V)=\sum_i{\frac{exp(q\cdot k^{<i>})}{\sum_jexp(q\cdot k^{<j>} )}}v^{<i>}

in Transformer network:
Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_k}})V

Why is there no summation in the 2nd formula ?

TMosh · September 13, 2021, 2:50am

It is implied in the softmax function itself.

mc04xkf · September 15, 2021, 4:55am

The softmax function has a summation in the denominator only, it does not include the other summation, does it ?

edwardyu · September 17, 2021, 2:18am

Hi @mc04xkf ,

After applying softmax, it outputs the weight distribution of each source word. You can think of it as how much the target word has to pay attention to each source word.
The outer loop (summation) is to calculate weighted sum over all of source words by multiplying weight by value vi.
By the way, in the above 2nd formula, V is a matrix outside of the softmax, and the equation is also represented as weighted sum.

Topic		Replies	Views
Self-Attention Summation and Information Loss Sequence Models coursera-platform	3	664	July 8, 2021
Self-Attention formula Sequence Models week-module-4 , coursera-platform	1	155	May 1, 2024
Course 5 Week 4 - Transformer Networks mechanics Sequence Models coursera-platform	1	508	April 21, 2022
C5W4A1 Understanding Self-Attention Sequence Models week-module-4 , coursera-platform	2	343	February 25, 2024
Transformer question: what is this v-value? Sequence Models coursera-platform	2	692	May 4, 2022

Summation in self-attention

Related topics