[week 4] Transformer Network - get_angles

Hi @Georgii @JoaoSilva @andrew1 @rogeriovazp @shubhamchhetri @Mrima , I try to share my understanding on this topic here. Hope it help. And @manifest @TMosh , please help to check if I understand it right, thanks~

Part 1
Basically , positional encoding is a matrix containing word’s positional information.

And the goal of positional encoding can be interpreted as following:

  1. Help to represent a word better by adding positional information to its word embedding, because the same word at different position may refer to different meaning.
  2. Help to improve computation efficiency, because all input words can be fed to the model all at once with positional encoding.

So, the key factor to be considered about PE matrix is that: it has to represent the order of the word in sequence.


Part 2
After understanding the goal and key factor of PE, now let’s turn to the solution that match the key factor – encoding positional information with sin & cos model.

Firstly, let’s check the cure of sin & cos:

Basically, sin & cos model can be represented as: sin(wt) & cos(wt)
Comparing it with PE’s representation, we can see that:

  • sin(wt) -> sin(w*pos), cos(wt) -> cos(w*pos)
    • t = pos, interpreted as time step of the cure
    • w = 1 / 10000^(2*i/d), interpreted as frequency of the cure

As i increasing, w will decrease, and the frequency of the cure will gets lower. So, for different i, it will generate a different angle, which means a different value after applying it to sin/cos.

In other word, for different position’s different element, sin/cos can generate a different value, which can be used to represent the positional information of the word in sequence. And this is the reason why sin/cos is chosen to represent positional encoding.

For more elaborate prove on why using sin and cos, please check this article.


Part 3

By now, We have the following two formulas:
formula

If we change the formulas a little bit, it may reduce the confusion of “2i” and “2i + 1” :
pe

And now, we come to the reason why is “i // 2” :

  • i = 0 → i // 2 = 0
  • i = 1 → i // 2 = 0 → angle_0 = angle_1
  • i = 2 → i // 2 = 1
  • i = 3 → i // 2 = 1 → angle_2 = angle_3
  • i = 4 → i // 2 = 2
  • i = 5 → i // 2 = 2 → angle_4 = angle_5

It ensures that each (even, odd) pairs get the same angle, and the only different between them is that the even will be applied to sin, and the odd will be applied to cos. It responses to the reason why we use both sin & cos.

For the sake of understanding, we can interpreted the (even, odd) pairs as one unit, and “i // 2” is wrote for implementing it.

And the reason we have (even, odd) pairs, is that this way provides a richer representation for positional information, even with a longer positions situation. And this is exactly the goal of positional encoding.

That’s it.
Hope it help and discussion welcomed.

19 Likes