Scaling the Embedding Outcome in the Encoder

I have 2 questions from the below snippet of code from enoder layer(class Encoder(tf.keras.layers.Layer)):

        # Pass input through the Embedding layer
        x = self.embedding(x)  # (batch_size, input_seq_len, embedding_dim)
        # Scale embedding by multiplying it by the square root of the embedding dimension
        x *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))
        # Add the position encoding to embedding
        x += self.pos_encoding[:, :seq_len, :]
  1. Why the scaling is done before adding positional encodings to the x?

  2. According to the original paper ‘‘Attention is All you need’’ the requirement of scaling and its reason is given below

But in the case of embedding matrix the input tensor to embedding matrix is sparse then why do we need to scale it?

I found this interesting question unanswered. I think Varun already finished an assignment, but my answers are for all learners who come to here by a search or others.

This includes two key things about scale factors.

For the first question about \sqrt{d_{model}}, the paper did not mention any reasons.

In the embedding layers, we multiply those weights by \sqrt{d_{model}}.

Then, there are several discussions about how this should be interpreted. I think most possible one is about the balancing between “word embedding” and “position encoding”. From several trials, we think that authors thought “word embedding” was relatively important than “position encoding”, and needed to consider some weights. Of course, “position encoding” is the only one source to see the position information of ‘words’. So, it can not be deleted. But, it looks like multiplying \sqrt{d_{model}} to “word embedding” worked well. Note that both has the same dimension. So, it is a matter of weights.

For the second question, it is not for this, but for the scaled dot-product attention. So, it is totally different discussion. In the case of a scaled dot-product attention, as its name shows, it is a “scaled” version of “dot-product attention”. For this portion, the paper clearly stated the reason as follows.

We suspect that for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by \frac{1}{\sqrt{d_k}}.

Different from the first one, this scale factor is \frac{1}{\sqrt{d_k}}.

Hope this clarifies scaling factors for “word embedding” and “dot-product attention”.