Why scale up embeddings by √d_model instead of scaling down positional encodings?

In “Attention Is All You Need,” the authors multiply the embedding weights by √d_model before adding positional encodings. The reasoning is clear — embeddings are initialized with small values (~0.01) while positional encodings (sin/cos) range from -1 to +1, so without scaling, positional encodings would dominate and drown out the token semantics.

But why scale UP the embeddings rather than scale DOWN the positional encodings by dividing by √d_model? Mathematically, the result should be the same — both approaches bring the two signals to the same relative scale.

One might argue that since embeddings are learnable and positional encodings are fixed, it’s “cleaner” to modify the learnable part. But I don’t find this convincing — if anything, it seems more natural to leave the learnable parameters alone (let the model figure out its own scale during training) and instead scale the fixed component to match.

Is there a concrete reason for this choice? A historical convention from prior work? A subtle interaction with weight tying (since the embedding matrix is shared with the output projection)? Or is this genuinely just an arbitrary implementation decision that doesn’t meaningfully affect training?

hi @ulyaaliyeva206

Looks like someone’s digging LLM technicalities :clap::clap::clap: after RAG Course. Great, keep learning!!!

as far as I understand the approach of scaling embeddings by √d_model is to ensure semantic information dominates over vector representation while positional information (positional encoding) act as a relative signal.

Word embeddings are learned semantic vectors. Multiplying them by √d_model makes sure to maintain their high-magnitude signals, preventing the fixed positional encoding values from dominating the sum (token embedding+positional embedding - holding semantic meaning and word order)

If embeddings weren’t scaled up, the PE values (usually normalized or normalized relative to embedding dimensions) might cause the combined representation to overemphasize position over meaning.

Also another reason! if I have to think from gradient wise, we usually scale up the input embedding, so as to allow learnable parameters to maintain variance in early layers of model training, thereby reducing the risk of gradients becoming too large or too small which leads to unstable learnings.

Why Not Scale Down Positional Encoding? Scaling down the positional encoding can also bring the magnitudes into balance, but doing so might make the positional signals too faint for the model to learn the relative order of words effectively.

The core basis transformer architecture is where sinusoidal positional encoding are fixed maintaining its positions in any sequence of input, model’s approach should! always be to scale inputs (embeddings), so as to let parameters learns its semantic meaning with relative position signal to get the most desired output.

For example- Dog is riding the horse; and The man is riding the horse.
Here both sentences hold similar relative positional encoding but dog vs man hold different space in semantic embedding, which will let LLM model learn to dominate the semantic embedding over the positional encoding.

Regards

Dr. Deepti