Why need to scale the embedding by multiplying it by the square root of the embedding dimension?
The same question is in in decode phase.
Can anyone explain? Thanks.
It’s not the same thing. In scale dot product, it is divide by sqrt(d), not multiplication.
On the other hand, this multiplication is done before the word embedding being passed to the multi head attention layers, in other words, it took place before the scale dot product actions.
I think I found the answer after I go through the ungraded lab " Transformer Pre-processing". The purpose of multiplying word embedding by sqrt(d) is to tweak the impact of positional encoding, and avoid the positional encoding vectors dominate the embedding. See details here:

