Understanding of basic Attention code

Hey @someone555777,
Let’s start with this query first, and once it is resolved, we will move to your other query.

Let me attach the diagram of the alignment model for our reference here.

alignment_model_3

Also, note some variables here:

hidden_size = 16
attention_size = 10
input_length = 5

Let’s decode this! Yes, we have the decoder hidden state corresponding to the last produced token, which has a size of (1, 16). And yes, we have the encoder hidden states corresponding to each of the input tokens, which have a size of (5, 16). Now, we simply repeat the decoder hidden state 5 times, i.e., it will also have a shape of (5, 16), and concatenate each one of it’s copies, with each of the encoder hidden states, and we get an output of shape (5, 32), which is represented by inputs. I am leaving here “horizontal” and “vertical”, since I am always confused by those terms :joy:


Now, we pass the inputs through a dense layer, to take one step further into computing the scores. The weights are representing the dense layer, denoted by the variable layer_1. I believe, you can easily answer as to what the weights are doing here now. If not, I believe, you need to go through the first course once again. So, activations is simply the output of the first dense layer. Here, inputs have a shape of (5, 32) and layer_1 has a shape of (32, 10), therefore, activations have a shape of (5, 10).


First of all, it’s important to note here that correlation doesn’t measure similarity, neither does causation. Read this thread if you are confused in this.

Now, our aim is to measure the similarity between the encoder hidden states and the current decoder hidden state, which are represented by the alignment scores, denoted by the variable scores. As to why we are using tanh, I don’t believe I have a strong reasoning for that. Although what I can tell you is that since it is a feed-forward network, so, we do have to use some non-linear activation function, and whenever we are dealing with similarity values, we often use tanh, since it helps us to classify between similar values (close to 1) and non-similar values (close to -1). Feel free to use other non-linear activation functions if they give you better results.


First of all, I don’t think “one of words embeddings of each word” makes any sense. Each of the words have only 1 word embedding, unless and until we switch the set of word embeddings for the entire vocab. The shape of scores is (5, 1), i.e., it contains the score for each of the words, by which we should weight the corresponding encoder hidden states.


To some extent, you are on the right track. layer_2 produces scores which helps us to decide which of the word’s corresponding encoder’s hidden states representation is more useful, and which is less. I don’t think the endgame is near :joy: But if you want to take a look at the end, the scores helps us to decide which of the encoder’s hidden state representations are more useful to compute the context vector, which in-turn helps us to decide the next word.


I am a little unsure as to what you are trying to say here. But let me present what I believe you are trying to say. We take the softmax over the scores to normalize them, and store these scores in the variable weights. We use these weights to take a weighted sum of the encoder_states, which forms our context vector, denoted by the variable context.

Let me take a pause here. Please do let me know if you find any confusion up until this point. If not, we will proceed further.

Cheers,
Elemento