def positional_encoding(positions, d):
"""
Precomputes a matrix with all the positional encodings
Arguments:
positions (int) -- Maximum number of positions to be encoded
d (int) -- Encoding size
Returns:
pos_encoding -- (1, position, d_model) A matrix with the positional encodings
"""

The parameter ‘d’ is the encoding size, which should be the same as embedding dimension, i.e. 300. But what about position? Why is it “Maximum number of positions to be encoded”? The number of positions should be equal to the number of tokens, which should be a fixed number, given a sequence. I don’t understand why it is called ‘maximum’?

If we go to exercise 5 where the positional_encoding is called to make the encodings, we can see that the same set of created encodings is going to be reused again and again in the call(). We are actually asked to use it (since it is a part of the exercise solution, so I will not go into that), and when you do the exercise, you will find that if we had not created enough encodings, we will run out of it when the sequence length (or the number of tokens as you said) is larger than the “Maximum number of positions to be encoded”.

Let’s ask ourselves these.

What will happen if we set it to 1? It’s not going to be enough for sequence longer than 1 token.

What will happen if we set it to an unreasonably large number, say 1000000? It’s not going to use all of them. If my sequence has 10 tokens, then it only uses 10 of them.

Cheers,
Raymond

PS: this is the C5 Week 4 Assignment. Please mention it next time

I still have no clue about the concept of maximum number positions. It maybe related to the fact that I don’t understand this line of code:

{code removed by mentor}

Why is it not just like this: x += self.pos_encoding

Also, what do you mean by " the same set of created encodings is going to be reused again and again in the call() ."? By ‘again and again’, do you mean for each training example x, it has to be called once? By “the same set of created encodings”, do you mean because the parameter ‘seq_lengh’ is the same, so for each call, the positional encoding will be always the same?

One more question, I checked the shape of x and the self.postion_encoding in the Encoder:

The above line tells us that, in principle, the seq_len can change from call() to call(). RIght? If it does not change, then we can just set it to a constant. If it can change, then it can equal to 5 this time, and then it can equal to 10 next time. Right?

We know the positional encodings need to be generated. Now we need to ask ourselves one question: do we generate it once and for all, or do we generate it once per each call()?

In the assignment, we are generating it once and for all, because we store it inside the variable called self.pos_encoding so that we can retrieve it back again and again in every call(), right? We have put it there, and so we can reuse it. It’s just like I have assigned a value of 5 into x by x = 5, and then I can reuse it by calling x over and over again. Right? This is why we store something into a variable because we can reuse it.

Now, why don’t we generate it once per each call(). The obvious reason is that it saves time if we generate it once and for all. The generation process takes time.

If we want to generate it once and for all, how many positional encoding do we need? We need to generate a sufficient number of positional encodings, right? Then we need to ask ourselves, at maximum, how many positions can I use? That number of positions is the maximum number of positions to be encoded.

It is just like now I need to generate a set of position encodings for us to use it forever, and then I ask you, “hey, how many positions can you use? Just tell me, at most how many will you need, so that I don’t need to generate it again when I find it not enough”. Then you tell me “Usually it is 20”, then I ask you “Really? Do you not have sentence that can possibly be 30?”, then you tell me “unlikely, but possible”, then I will say, “OK, I will give you 50 then, just to make sure it must be enough forever”. 50 is the maximum number of positions to be encoded. Get it?

I didn’t check the shape. But if you ask me how tensorflow adds two array of different shapes, then “broadcasting” is the topic you need to read about. Here is the broadcasting explantion by Tensorflow, and you will find out under what situation you can add two tensors of different shapes. Broadcasting is also adopted in numpy, so reading it is going to be useful for using both packages.

After you read about broadcasting, you will see that x += self.pos_encoding will fail.

" we want to generate it once and for all": So position encoding has nothing to do with words or meanings in a sentence, only position matters, and that’s why one position encoding can be used for all different sentences? and that’s why we used "self.pos_encoding[:, :seq_len, :] to get the pos_encoding for the current sentence by up to ‘seq_len’ positions in the 2nd dimension? This is different from input embedding, which matters with words in a sentence.

Yes, position encoding is just about encoding the position of each word in a sentence. It is not responsible for the meaning of a word. Yes, if seq_len = 10, then we only need the 0th to the 9th position encodings to represent the sequence’s word positions.