Positional Embeddings

While adding a positional component(pc) in our word embeddings, while I understand the idea that pc will remain same for all of our examples hence the model can learn which position comes where. Won’t adding the pc and word embedding (wd) together results in a loss of the original information of both pc and wd? adding another pc to wd leads to the vector of wd pointing in some other direction, so we can’t tell the word anymore, and vice-versa.

Is the model somehow learning to ‘minus’ the pc from the positional embeddings so that it can get access to the original word embeddings and still retain the positional information somehow?

PS: I couldn’t add [week4] Tag, hence question is in general

Based on experimentation, the original authors of transformers probably found that the model performed well based on their approach of pc + wd.

One way to think about this is that the model learns about the dataset not just wrt to the word embedding but also based on its position within the sentence.

If you come up with another way to combine wd and pc that works better (keeping time to train in mind), please share it.

1 Like

Hello, @Melange-Lf,

I agree with @balaji.ambresh. I think we need to change our way of thinking and be ready to give up what is necesary for a human to understand a word. We human need the original word, but that does not mean a model needs the original word embedding, too. We can do experiment to find out, if at any point, the model will minus out the positional embedding, but we definitely do not program the model to do it or to learn to do it. This means that, we let the training process to decide, and it probably won’t be a clean minus, even if there is a minus.

However, gradient descent does differentiate pc and wd when updating (learning) the values of pc, because pc and wd are added, not multiplied. Consider a very simple example: y = w (pc + wd), then \frac{\partial{y}}{\partial{pc}} has nothing to do with wd, hasn’t it? But certainly the update of w is affected by both, even though you may say that they are affecting each other indirectly through w. :wink:

Cheers,
Raymond

1 Like