This concerns the implementation of Encoder, when word index is first embedded into vectors by tf.keras.layers.Embedding.

So why is scaling by multiplying sqrt(emb_dim) needed? I don’t believe this is explicitly mentioned in the paper “Attn is all you need”. And I am not clear about its justification.

Does this depend on the particular default init strategy of keras’s Embedding and/or the Ws ("projections) inside MultiHeadAttention, such that the vector norm can make things hard to train? I have a feeling this may be very implementation dependent on what the framework class does by default.