Hi
I just finished the ungraded lab about positional enconding and I have a question regarding the weights that multiply embedding vector and positional encoding vector.
So, on the lab we have
embedding * W1 + pos_encoding[:,:,:] * W2
I would like to confirm something
So, we should not adjust the weight of positional encoding too much, right? Otherwise, the embedding vectors will be dominated and lose their semantic quality?
And, actually, I am quite interested in W1 and W2 here in terms that recently I have been using a huggingface transformer, and I haven’t encountered hyper params that allow me to adjust the weight of positional encoding vector
Does anyone have some reference for this?
I have not tried to look for any external references about positional encodings and have never worked with the HuggingFace code, but I think the W1 and W2 values that they use in this ungraded notebook are just experiments to give you a sense of the different effects and meaning of the word embeddings versus the positional encodings. By overemphasizing one or the other, you get to see the effect and meaning of the values.
But now look at what happens when they actually apply positional encodings in the real graded Transformers Lab assignment this week: they just add the values with no weight factor in both the Encoder and the Decoder.
So my conclusion is that the experiments shown here in this ungraded lab have nothing to do with how positional encodings are actually used.
2 Likes
Agree with Paul. W1 and W2 are for this ungraded lab to show us the effect, but they don’t exist as hyperparemeters in real transformers.
2 Likes
Thanks both of you for the explanation. This clear a lot of thing for me.
1 Like