Positional encoding in transformer networks (W4) - why adding as opposed to concatenating?

I read the explanation in the assignment, but I am still wondering, instead of “contaminating” the word embedding vectors by adding some small sin/cos positional encodings, why not feed the positional information by appending a position index to the word embedding vector instead?

As this value is added in a new dimension, this would not distort the semantic meaning of the word embedding, would it? Or there is a more practical issue that stops this from working?

Hi ryl,

The position of a word has implications for the meaning to be attached to it - for instance, relevant importance, emphasis, primacy or recency effects, etc. So it’s an integral part of the meaning features that are to be extracted. Adding positional encodings should be seen as enriching the input embeddings.

As an aside, the positional encodings are small so as not to overshadow other meaning features. So in case with position index you would be referring to integers, these would be too large and overshadow the other meaning features. @balaji.ambresh

1 Like

Thank you so much for helping me with my question. I think I get the general idea, at least very superficially.

I understand that the positions/ordering is important, e.g. you can turn a statement into a question by reordering a couple of words, but I am not very clear why adding the sin/cos encoding numerically is the best method for this purpose. For example, let’s generate the sin/cos position encoding, call it p, and call the original word embedding e, why do we add them numerically (so e + p), rather than concatenating them (so e.append(p)), wouldn’t the concatenation preserve more information about both the original word embedding and the position?

Also, I understand that adding the position index as integer to the word embedding might be too large, but appending it would not have the same issue as there’d be separate weights/parameters for the extra dimension.

Apologies if anything I wrote is hard to understand and I will try to explain better what I mean.

Hi again ryl,

The positional encodings do not have any special meaning separate from the words they are the positional encodings for. Only the combination of word embeddings and positional encodings is meaningful.