I found this interesting question unanswered. I think Varun already finished an assignment, but my answers are for all learners who come to here by a search or others.

This includes two key things about scale factors.

For the first question about \sqrt{d_{model}}, the paper did not mention any reasons.

In the embedding layers, we multiply those weights by \sqrt{d_{model}}.

Then, there are several discussions about how this should be interpreted. I think most possible one is about the balancing between “word embedding” and “position encoding”. From several trials, we think that authors thought “word embedding” was relatively important than “position encoding”, and needed to consider some weights. Of course, “position encoding” is the only one source to see the position information of ‘words’. So, it can not be deleted. But, it looks like multiplying \sqrt{d_{model}} to “word embedding” worked well. Note that both has the same dimension. So, it is a matter of weights.

For the second question, it is not for this, but for the scaled dot-product attention. So, it is totally different discussion. In the case of a scaled dot-product attention, as its name shows, it is a “scaled” version of “dot-product attention”. For this portion, the paper clearly stated the reason as follows.

We suspect that for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by \frac{1}{\sqrt{d_k}}.

Different from the first one, this scale factor is \frac{1}{\sqrt{d_k}}.

Hope this clarifies scaling factors for “word embedding” and “dot-product attention”.