The scaled dot product of Q and K is divided by square root of d.
- What is the rationale of choosing d as a factor for scaling?
- Why is its square root used to scale ?
EDIT:
I found explanation of why scaling is done using d here, but not why the square root is used:
Hi @Ritu_Pande
That is a good question
The dot product of Q and K^T increases the variance because multiplying two random numbers increases the variance.
One of the best explanations I have seen is by Andrej Karpathy here in the span of 3 minutes. (note, wei here is equal to Q \cdot K^T).
I would encourage to try it and see for yourself, like:
Cheers
Thank your for the reference and the details. After going through the video you shared, I also found theoretical proof ( here ) of how dot product of two random variables of length d_k each with variance =1 and mean=0 results in variance = d_k.
Therefore, we have to divide with the s.d. to scale down the distribution.
Thanks for all the help you have been giving me in understanding these concepts.
2 Likes