[Week 4]Exercise 5 - Encoder. Why need to scale the embedding by sqrt(d)?

Shengwu · August 13, 2021, 2:35am

Why need to scale the embedding by multiplying it by the square root of the embedding dimension?
The same question is in in decode phase.
Can anyone explain? Thanks.

TMosh · August 13, 2021, 3:59am

This is explained in the Self Attention lecture, see the time mark in this image:

Shengwu · August 13, 2021, 7:22am

It’s not the same thing. In scale dot product, it is divide by sqrt(d), not multiplication.
On the other hand, this multiplication is done before the word embedding being passed to the multi head attention layers, in other words, it took place before the scale dot product actions.

I think I found the answer after I go through the ungraded lab " Transformer Pre-processing". The purpose of multiplying word embedding by sqrt(d) is to tweak the impact of positional encoding, and avoid the positional encoding vectors dominate the embedding. See details here:

Topic		Replies	Views
Scaling the Embedding Outcome in the Encoder Sequence Models coursera-platform	1	772	July 7, 2022
Why does embedding need to be rescaled by multiplying square root of the embedding dimension? Sequence Models coursera-platform	2	614	August 13, 2021
Why scale up embeddings by √d_model instead of scaling down positional encodings? AI Discussions ai-discussions	1	42	March 25, 2026
DLS Course 5, Week 4, assignment, Exercise 5 Sequence Models coursera-platform	2	552	July 13, 2022
Purpose of sqrt(dim(k)) in Scaled dot product attention NLP with Attention Models week-module-1	3	1516	November 19, 2021

[Week 4]Exercise 5 - Encoder. Why need to scale the embedding by sqrt(d)?

Related topics