[week 4] Transformer Network - get_angles

Have the issue: “Submatrices of odd and even columns must be equal”.

Code in “get_angles” function:

{mentor edit: code removed}

Looks it computes values and keeps dimensions right (even without “.astype(np.float)”)

4 Likes

Solved. The problem was to use the correct exponentiation

{mentor edit: code removed}

26 Likes

Had the same issue, and yes: your formula works (with np.power).

But I’ll admit: I’m not seeing why we multiply 2 by (i // 2)… Supposedly, the formula is just 2*i.
I’m a bit puzzled…

11 Likes

Me too. I did not see the reason behind this.

1 Like

wow nice I have been looking for a one-line solution for this. Thanks! for folks being a bit confused, the i is the numpy.ndarray ,not the i in the formula, so, given

[[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]]

say 2 // 2 is 1 , 3//2 is 1 then, you have

[[0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7]]

list does not support this operator I think

2 Likes

Can someone explain more about this. Sorry, i still could not understand why the i should be divided by 2.

3 Likes

I have an issue, in the following how do I get a value for i to pass to get_angle method?

    angle_rads = get_angles(None,
                        None,
                        None)
1 Like

Hey everybody :slightly_smiling_face:

In the assignment, the function get_angles(...) initializes a matrix of all possible values for the sine and cosine functions that we use later to calculate positional encoding.

In this video Andrew explains that a sine curve matches the following cosine curve, that means the values for both curves should be the same. In out matrix initialized by get_angles(...) values for each curve are stored in columns, and i specifies a column index for each curve. To make the values same between neighboring columns we use i // 2 in our implementation.

@andrew1 @rogeriovazp

8 Likes

Hi @dibyendu ,

As far as I can see, the angle_rads variable is only used in the positional_encoding(...) function. The function is already implemented for you. You shouldn’t change its code.

If it is not implemented, try to refresh the notebook – you may have an obsoleted version of it.

2 Likes

@dibyendu, it seems that I’ve implemented the function and forgot about that. You don’t need to reset the notebook – my bad :slightly_smiling_face:

As you can see from the function definition, you need

  • a column vector with positions for its first argument, np.arange(vector_dimension)[:, np.newaxis] may come in handy.
  • a row vector with our column indices, the np.arange may also be used for that purpose.
  • the last argument, is just an integer – a dimension of the inputs/embedings.

3 Likes

The previous answer to this by @manifest was very unclear and I am also finding it hard to understand why we use i//2. Kindly please explain this in more simple terms.

1 Like

To compute position encoding vectors, in the assignment:

  • We initialize a matrix of all possible values for the sine and cosine functions: N = get_angels(...).

    The matrix N has dimensionality pos x i. Where pos is a position (or time step) and i refers to a different dimension of the position encoding (e.g. i = {1,2,...,n}).

  • Later we use values in the matrix N to calculate actual values for each position encoding passing values from N as arguments to sine and cosine functions.

    Each row of the matrix N provides values required to calculate a position encoding for the position pos.

    The value of the position encoding vector at index i correspond to a point on the sinusoid curve for odd indices i in {1,3,5,...} and a point on the matched cosine curve for even indices i in {2,4,6,...}.

Screen Shot 2021-05-23 at 15.44.02

@shubhamchhetri I hope this makes things clear.

6 Likes

was able to proceed but I am kind of stuck at this place:

Not sure if I am doing something wrong!

# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION DecoderLayer
{mentor edit; code removed}
1 Like

OKay I had one line wrong, now it works for me.

{mentor edit: code removed}

Same for me I am still confused why the formula was changed

I just want to chime-in and say that’s not at all apparent from the exercise instructions. The equation we’re told to implement says to multiply 2i, but the expected output requires you to use (2*i//2).

This is obviously confusing. I think the instructions need some more clarity.

9 Likes

Hi
This thing is obviously creating problems, maybe instructions needs to be more clear.
For anyone else like me, who came across this problem. This is the formula which you need to look at.
image
For now, just look at the term with ‘i’.
For 2i on the left, we have 2i on right side,
but for 2i+1 too, we have 2i on right side.
Our input to the function can have both, 2i or 2i + 1 (even or odd) but we only need to use 2i (even value) even if its 2i + 1.
When you use ‘//2’ , 2i remain same, but 2i + 1 will become 2i , that is 2i + 1 // 2 == 2i
Hope it helps
Jimmy

17 Likes

Thanks! You answered the question of why using “i // 2” here.

The main problems comes from the notation of i. For me, initially I suppose “i” is already divided by 2.

Hi @Georgii @JoaoSilva @andrew1 @rogeriovazp @shubhamchhetri @Mrima , I try to share my understanding on this topic here. Hope it help. And @manifest @TMosh , please help to check if I understand it right, thanks~

Part 1
Basically , positional encoding is a matrix containing word’s positional information.

And the goal of positional encoding can be interpreted as following:

  1. Help to represent a word better by adding positional information to its word embedding, because the same word at different position may refer to different meaning.
  2. Help to improve computation efficiency, because all input words can be fed to the model all at once with positional encoding.

So, the key factor to be considered about PE matrix is that: it has to represent the order of the word in sequence.


Part 2
After understanding the goal and key factor of PE, now let’s turn to the solution that match the key factor – encoding positional information with sin & cos model.

Firstly, let’s check the cure of sin & cos:

Basically, sin & cos model can be represented as: sin(wt) & cos(wt)
Comparing it with PE’s representation, we can see that:

  • sin(wt) -> sin(w*pos), cos(wt) -> cos(w*pos)
    • t = pos, interpreted as time step of the cure
    • w = 1 / 10000^(2*i/d), interpreted as frequency of the cure

As i increasing, w will decrease, and the frequency of the cure will gets lower. So, for different i, it will generate a different angle, which means a different value after applying it to sin/cos.

In other word, for different position’s different element, sin/cos can generate a different value, which can be used to represent the positional information of the word in sequence. And this is the reason why sin/cos is chosen to represent positional encoding.

For more elaborate prove on why using sin and cos, please check this article.


Part 3

By now, We have the following two formulas:
formula

If we change the formulas a little bit, it may reduce the confusion of “2i” and “2i + 1” :
pe

And now, we come to the reason why is “i // 2” :

  • i = 0 → i // 2 = 0
  • i = 1 → i // 2 = 0 → angle_0 = angle_1
  • i = 2 → i // 2 = 1
  • i = 3 → i // 2 = 1 → angle_2 = angle_3
  • i = 4 → i // 2 = 2
  • i = 5 → i // 2 = 2 → angle_4 = angle_5

It ensures that each (even, odd) pairs get the same angle, and the only different between them is that the even will be applied to sin, and the odd will be applied to cos. It responses to the reason why we use both sin & cos.

For the sake of understanding, we can interpreted the (even, odd) pairs as one unit, and “i // 2” is wrote for implementing it.

And the reason we have (even, odd) pairs, is that this way provides a richer representation for positional information, even with a longer positions situation. And this is exactly the goal of positional encoding.

That’s it.
Hope it help and discussion welcomed.

19 Likes

Hey @Damon,

thank you for posting this. I think It’s really great step by step explanation.

For the 3rd part, I guess the actual motivation of using cos and sin functions comes from the derivation explained here.

There’s also a colab with position encoding visualization from the Jalammar’s post about transformer networks. You guys can see how values at each dimension of position vector changes for different positions in a sequence. Try to change the position encoding generation function and see results.

1 Like