Positional Embeddings

Melange-Lf · June 20, 2024, 7:22am

While adding a positional component(pc) in our word embeddings, while I understand the idea that pc will remain same for all of our examples hence the model can learn which position comes where. Won’t adding the pc and word embedding (wd) together results in a loss of the original information of both pc and wd? adding another pc to wd leads to the vector of wd pointing in some other direction, so we can’t tell the word anymore, and vice-versa.

Is the model somehow learning to ‘minus’ the pc from the positional embeddings so that it can get access to the original word embeddings and still retain the positional information somehow?

PS: I couldn’t add [week4] Tag, hence question is in general

balaji.ambresh · June 20, 2024, 4:32pm

Based on experimentation, the original authors of transformers probably found that the model performed well based on their approach of pc + wd.

One way to think about this is that the model learns about the dataset not just wrt to the word embedding but also based on its position within the sentence.

If you come up with another way to combine wd and pc that works better (keeping time to train in mind), please share it.

rmwkwok · June 20, 2024, 10:28pm

Hello, @Melange-Lf,

I agree with @balaji.ambresh. I think we need to change our way of thinking and be ready to give up what is necesary for a human to understand a word. We human need the original word, but that does not mean a model needs the original word embedding, too. We can do experiment to find out, if at any point, the model will minus out the positional embedding, but we definitely do not program the model to do it or to learn to do it. This means that, we let the training process to decide, and it probably won’t be a clean minus, even if there is a minus.

However, gradient descent does differentiate pc and wd when updating (learning) the values of pc, because pc and wd are added, not multiplied. Consider a very simple example: y = w (pc + wd), then \frac{\partial{y}}{\partial{pc}} has nothing to do with wd, hasn’t it? But certainly the update of w is affected by both, even though you may say that they are affecting each other indirectly through w.

Cheers,
Raymond

Topic		Replies	Views
Embedding * W1 + pos_encoding * W2 in positional encoding lab Sequence Models week-4	3	206	April 16, 2024
Week 4 Positional Encoding Sequence Models week-4	5	272	April 18, 2024
Positional encoding in transformer networks (W4) - why adding as opposed to concatenating? Sequence Models	3	550	November 16, 2022
Doesn't positional encoding create noise in embedding(features) of word? NLP with Attention Models week-2	1	564	September 24, 2022
Transformer: Why Add, Not Concat? Sequence Models	7	1633	April 26, 2023

Positional Embeddings

Related topics