Understanding of basic Attention code

Hi! If I understand correct, this is one of the most important strings from ungraded lab:
activations = np.tanh(np.matmul(inputs, layer_1))

Can you explain to me what happens?
Ok, we concatenated by horizontal axis encoder hidden states with hidden state of last hidden state of decoder layer (that contains parts of computations of previous from it decoder’s hidden states).
Ok, I understand, that we have some weights on this step (layer_1), that will control… What exactly? As I understand, what of encoder hidden states are correlated at all with decoder hidden state? By this reason we use tanh? So, if negative number - not correlated. If positive - correlated.

And what does scores computation? As I see by its shape, it is connected only with one of words embeddings of each word? So, do layer_2 weights help us to understand what of embedding features are most important to compute next word? And in the end we get distribution of words that have most big influence for previous decoder hidden state?

And after this we compute attention by itself to compute by softmax what of this returned weights are most important? And with help of weights computed we can reduce a bit word embeddings of encoder and get synthetic variable context in the end that will contain most probable word embedding of next word?

And few additions.

  1. Do I understand correct, that we compute aligment and attention in this lab only for one of predicted words in translation (decoder_state). But usually we need to compute this attention for each next word. So, it will be multiple calls of attention(encoder_states, decoder_state), where encoder_states will be different (different output words), but encoder_states are the same.
  2. We have hidden_size. If I understand correct, it is words embeddings size of one word. And after this we transform it through linear regresstion and tanh. We get activations with attention_size columns, that contains classification. But what is it at all? How have we got 5x10 matrix from 5 words with 16x2 word embeddings size? And why we didn’t multiply 5 on 2 btw?
  3. As I understand we need tanh to stricktly dedicate what words in input sentence are connected with generated output. And we get approximate numbers form alignment() func. But after this we do softmax… Why have we not done softmax after computation of activations in aligment? Looks like that I don’t fully understand sense of alignment() at all

Hey @someone555777,
Let’s start with this query first, and once it is resolved, we will move to your other query.

Let me attach the diagram of the alignment model for our reference here.


Also, note some variables here:

hidden_size = 16
attention_size = 10
input_length = 5

Let’s decode this! Yes, we have the decoder hidden state corresponding to the last produced token, which has a size of (1, 16). And yes, we have the encoder hidden states corresponding to each of the input tokens, which have a size of (5, 16). Now, we simply repeat the decoder hidden state 5 times, i.e., it will also have a shape of (5, 16), and concatenate each one of it’s copies, with each of the encoder hidden states, and we get an output of shape (5, 32), which is represented by inputs. I am leaving here “horizontal” and “vertical”, since I am always confused by those terms :joy:

Now, we pass the inputs through a dense layer, to take one step further into computing the scores. The weights are representing the dense layer, denoted by the variable layer_1. I believe, you can easily answer as to what the weights are doing here now. If not, I believe, you need to go through the first course once again. So, activations is simply the output of the first dense layer. Here, inputs have a shape of (5, 32) and layer_1 has a shape of (32, 10), therefore, activations have a shape of (5, 10).

First of all, it’s important to note here that correlation doesn’t measure similarity, neither does causation. Read this thread if you are confused in this.

Now, our aim is to measure the similarity between the encoder hidden states and the current decoder hidden state, which are represented by the alignment scores, denoted by the variable scores. As to why we are using tanh, I don’t believe I have a strong reasoning for that. Although what I can tell you is that since it is a feed-forward network, so, we do have to use some non-linear activation function, and whenever we are dealing with similarity values, we often use tanh, since it helps us to classify between similar values (close to 1) and non-similar values (close to -1). Feel free to use other non-linear activation functions if they give you better results.

First of all, I don’t think “one of words embeddings of each word” makes any sense. Each of the words have only 1 word embedding, unless and until we switch the set of word embeddings for the entire vocab. The shape of scores is (5, 1), i.e., it contains the score for each of the words, by which we should weight the corresponding encoder hidden states.

To some extent, you are on the right track. layer_2 produces scores which helps us to decide which of the word’s corresponding encoder’s hidden states representation is more useful, and which is less. I don’t think the endgame is near :joy: But if you want to take a look at the end, the scores helps us to decide which of the encoder’s hidden state representations are more useful to compute the context vector, which in-turn helps us to decide the next word.

I am a little unsure as to what you are trying to say here. But let me present what I believe you are trying to say. We take the softmax over the scores to normalize them, and store these scores in the variable weights. We use these weights to take a weighted sum of the encoder_states, which forms our context vector, denoted by the variable context.

Let me take a pause here. Please do let me know if you find any confusion up until this point. If not, we will proceed further.


At first, I would like to specify: all this steps we do only just to get context that will help to predict current decoder word, right? But we haven’t learned this step (of computing of current decoder embeddings) at all. Is it because this type of attention not used on practice? As I understand it should be something like the using of this softmax output (attention func) on vocabulary or something like this. As I see, we just started to learn dot-attention immidiately after this lab.

Okay, so this operation just add (concatenete) the same embeddings of previous decoder to the each of embeddings of encoder and we get x2 embeddings features with the same token numbers (rows in matrix).

I don’t very understand why you do so big accent on this.

Explain me, please, a bit more clearly this moment. Maybe you ment just classification functions?

How is it? What is hidden_size in this case? Isn’t a length of features of word embeddings?

Oh, nice. Thank you very much! It brings a bit of light. So, this is one more learnable linear function in the training that tries to understand what similatiries are usefull. Looks interesting, but why we can’t just sort by descend the tanh similarities?

Yes, it was a bit of mess from my side. One of unanswered questions was from here: why did we need layer2? Why we haven’t used softmax directly on tanh output? And why this tanh output can’t be weights by themselve?

Also, can you explain me a bit this string context = np.sum(weighted_scores, axis=0). Is it something like document embedding from contained words embeddings in it that we created earlier in the courses ?

Also I would like to know, how is this all connected with Attention Model, that was described in Deep Learning course? Was it BERT attention model?