I’m looking at Scaled Dot-Product Attention: Ungraded Lab
I think I understand the reason for all of the operations in Attention(Q,K,V) except for the division by the sqrt(dimension of embeddings in K).
The sqrt(dimension of embeddings in K) division seems to reduce the magnitude of difference for each comparison going into the softmax, and so I do understand that the division impacts the resulting weighting. But why division by this specific quantity? It would make some intuitive sense–though I still wouldn’t understand why it was necessary–if we were dividing by dimension of embeddings in K without the sqrt because each value in the QK^T matrix is the sum of K multiplications. But the sqrt makes me think I’m going in the wrong direction with this line of thinking.
Thanks in advance for your patience.
Hi @Matt_Stults .
Regarding the sqrt(dim_k) you can refer to the paper of Attention is all you need where they compare an additive attention with a dot-product attention. Basically the additive attention outperforms the dot-product attention if an scaling factor is not used. And also they mention that for lage values of dim_k the dot products can grow large in magnitude affecting the softmax function and the corresponding gradients. So the scaling factor sqrt(dim_k) can be viewed as a way to avoid the vanishing gradient problem.
Hope it helps.
Thanks! I had been thinking of this like a typical mathematical formula where each piece has a deep and specific meaning about the relationship of the parts. For example, I was looking for some deeper meaning to the fact that it’s a square root vs a cube root. My interpretation from your answer is that this is more of a pragmatic choice: sqrt(dim_k) just works better in practice than the alternatives tried. Does this sound correct or am I likely still just missing the meaning of this term?
Hi @Matt_Stults . You are right. The use of the term sqrt(dim_k) is more like a heuristic proposition to ensure that the argument of the softmax function will not grow too much so that the softmax won’t reach asymptotic values where the gradient may vanish.