Positional encoding in transformer networks (W4) - why adding as opposed to concatenating?

ryl · November 16, 2022, 11:39am

I read the explanation in the assignment, but I am still wondering, instead of “contaminating” the word embedding vectors by adding some small sin/cos positional encodings, why not feed the positional information by appending a position index to the word embedding vector instead?

As this value is added in a new dimension, this would not distort the semantic meaning of the word embedding, would it? Or there is a more practical issue that stops this from working?

reinoudbosch · November 16, 2022, 3:13pm

Hi ryl,

The position of a word has implications for the meaning to be attached to it - for instance, relevant importance, emphasis, primacy or recency effects, etc. So it’s an integral part of the meaning features that are to be extracted. Adding positional encodings should be seen as enriching the input embeddings.

As an aside, the positional encodings are small so as not to overshadow other meaning features. So in case with position index you would be referring to integers, these would be too large and overshadow the other meaning features. @balaji.ambresh

ryl · November 16, 2022, 8:49pm

Thank you so much for helping me with my question. I think I get the general idea, at least very superficially.

I understand that the positions/ordering is important, e.g. you can turn a statement into a question by reordering a couple of words, but I am not very clear why adding the sin/cos encoding numerically is the best method for this purpose. For example, let’s generate the sin/cos position encoding, call it p, and call the original word embedding e, why do we add them numerically (so e + p), rather than concatenating them (so e.append(p)), wouldn’t the concatenation preserve more information about both the original word embedding and the position?

Also, I understand that adding the position index as integer to the word embedding might be too large, but appending it would not have the same issue as there’d be separate weights/parameters for the extra dimension.

Apologies if anything I wrote is hard to understand and I will try to explain better what I mean.

reinoudbosch · November 16, 2022, 11:57pm

Hi again ryl,

The positional encodings do not have any special meaning separate from the words they are the positional encodings for. Only the combination of word embeddings and positional encodings is meaningful.

Topic		Replies	Views
Week 4 Positional Encoding Sequence Models week-4	5	272	April 18, 2024
Why is Positional Encoding added to the input, instead of being concatenated to it? Sequence Models	1	499	October 3, 2022
Transformer: Why Add, Not Concat? Sequence Models	7	1633	April 26, 2023
Can positional encoding be meaningfully generalized to non-integer position? Sequence Models	2	774	July 18, 2021
Doesn't positional encoding create noise in embedding(features) of word? NLP with Attention Models week-2	1	564	September 24, 2022

Positional encoding in transformer networks (W4) - why adding as opposed to concatenating?

Related topics