self-attention:
A(q,K,V)=\sum_i{\frac{exp(q\cdot k^{<i>})}{\sum_jexp(q\cdot k^{<j>} )}}v^{<i>}
in Transformer network:
Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_k}})V
Why is there no summation in the 2nd formula ?
self-attention:
A(q,K,V)=\sum_i{\frac{exp(q\cdot k^{<i>})}{\sum_jexp(q\cdot k^{<j>} )}}v^{<i>}
in Transformer network:
Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_k}})V
Why is there no summation in the 2nd formula ?
It is implied in the softmax function itself.
The softmax function has a summation in the denominator only, it does not include the other summation, does it ?
Hi @mc04xkf ,
After applying softmax, it outputs the weight distribution of each source word. You can think of it as how much the target word has to pay attention to each source word.
The outer loop (summation) is to calculate weighted sum over all of source words by multiplying weight by value vi
.
By the way, in the above 2nd formula, V is a matrix outside of the softmax, and the equation is also represented as weighted sum.