[Week 4]Exercise 5 - Encoder. Why need to scale the embedding by sqrt(d)?

Why need to scale the embedding by multiplying it by the square root of the embedding dimension?
The same question is in in decode phase.
Can anyone explain? Thanks.

This is explained in the Self Attention lecture, see the time mark in this image:

It’s not the same thing. In scale dot product, it is divide by sqrt(d), not multiplication.
On the other hand, this multiplication is done before the word embedding being passed to the multi head attention layers, in other words, it took place before the scale dot product actions.

I think I found the answer after I go through the ungraded lab " Transformer Pre-processing". The purpose of multiplying word embedding by sqrt(d) is to tweak the impact of positional encoding, and avoid the positional encoding vectors dominate the embedding. See details here:

2 Likes