Scaled_dot_product_attention

According to the attention definition:
attention

the dot product of Q and K^{T} should be scaled down by square root of d_k.
However after got d_k from the k.ndim, I found that only divide d_k instead of its square root could pass the unit test. Could this be attribute to my implementation details, or the unit test itself?

hi xharles,

I think .ndim is not the right property for this excercise as this gives you the dimension of the np array in terms of “how many axis does this tensor have”. → the value 2 → it’s a 2-dimensional array. But what you need is rather “how long is the dimension of that axis”, so .shape property might be helpful for you.
I hope this hint helps you solving :slight_smile:

P.s. the sqrt of the right solution is by coincidence equal to .ndim, but that only applies for the unittest example

2 Likes