One of the new thing I learnt over Transformer is the use of positional encoding. It is a quite fascinating concept and if mentors know if there are any papers or resources that focus on theoretical (and intuitive) properties of good positional encodings in general, please post.

I also wonder out loud if “pos” in the sin/cosine way of doing positional encoding as in transformer, can be meaningfully generalized to non-integer position. E.g. if your sequence is a time series and each position is annotated by the time the event took place (which is a float).

And furthermore, this reminded me (at least superficially) Fourier Transform. Were the researchers motivated by this?