Why does embedding need to be rescaled by multiplying square root of the embedding dimension?

This concerns the implementation of Encoder, when word index is first embedded into vectors by tf.keras.layers.Embedding.

So why is scaling by multiplying sqrt(emb_dim) needed? I don’t believe this is explicitly mentioned in the paper “Attn is all you need”. And I am not clear about its justification.

Does this depend on the particular default init strategy of keras’s Embedding and/or the Ws ("projections) inside MultiHeadAttention, such that the vector norm can make things hard to train? I have a feeling this may be very implementation dependent on what the framework class does by default.

I have the same question…

I think I found the answer, please see my post: [Week 4]Exercise 5 - Encoder. Why need to scale the embedding by sqrt(d)? - #3 by Shengwu