What is the reason for using both sine (for even indices) and cosine (for odd indices) transformations in the positional encoding? Why not use just one or the other?
A short answer to that is because this way the calculations are easier/simpler to get the surface we (or the “Attention is all you need” authors) want.
To add my intuition:
sins (all the way to 1000):
cose’s (all the way to 1000):
sins and cos’s interleaved (I’m not sure of the right word here), all the way to 1000:
Answer to your original question - it’s simple to calculate this matrix that way.
To add my view why we need this matrix, is because if we dot multiply it with transpose of itself, we get (1000x1000):
In other words, this is the surface on which embeddings have to “adapt”. The green values are the highest, the red ones are the lowest. Attention (if without any transformation) for tokens in the same positions would attend mostly to themselves, and they would attend least to the tokens furthest from them. As explained in the link above, this pattern contains absolute and relative information. Even more on that
If I remember correctly, interestingly, as a result of this is that usually the first few features in the embeddings are usually dominated by positional encoding, while the “meaning” of tokens is usually reserved for the last indexed features.