According to the attention definition:
the dot product of Q
and K^{T}
should be scaled down by square root of d_k
.
However after got d_k
from the k.ndim
, I found that only divide d_k
instead of its square root could pass the unit test. Could this be attribute to my implementation details, or the unit test itself?
hi xharles,
I think .ndim is not the right property for this excercise as this gives you the dimension of the np array in terms of “how many axis does this tensor have”. → the value 2 → it’s a 2-dimensional array. But what you need is rather “how long is the dimension of that axis”, so .shape property might be helpful for you.
I hope this hint helps you solving
P.s. the sqrt of the right solution is by coincidence equal to .ndim, but that only applies for the unittest example
2 Likes