Attention Weights Issue in C5 W3 Neural Machine Translation Assignment (Section 3.1)

In the optional part of the C5 W3 Neural Machine Translation assignment, specifically in section 3.1 - Getting the Attention Weights From the Network, I’m encountering an issue where the outputted attention values are completely incorrect. The condition \sum_{t'} \alpha^{\langle t, t' \rangle} = 1 is not satisfied, as shown in the attached image.

image

Despite this, I still received a 100% score from the grader. I’ve reviewed my code thoroughly and everything seems correct.

Has anyone else experienced a similar problem? Any insights or suggestions would be greatly appreciated.

Thanks!

1 Like

Hi @toni_1 !

Is it possible that you have output the attention weights from different queries? Maybe you can try to print the attention weight values from several different queries and check again.

Also, one need to understand here alpha value is probability output of the softmax and not a confirmation to equation attention weight values to get correct output. Hope you get this part.

@XinghaoZong I’ve tried other queries, but I’m encountering the same issue.

Looking at Figure 8: Full Attention Map in the notebook, it’s clear that each row should sum to 1. It seems like the softmax function applied to e^{\langle t, t' \rangle} isn’t working as expected.

I’m not entirely sure what @Deepti_Prasad means by “a confirmation to equation attention weight values to get correct output.”

If anyone has additional insights or suggestions, I would appreciate the help. Thanks for your responses!

Hi Toni, were you able to figure out? I think I might be facing a similar issue, as my attention map looks exactly the same as yours. Though I’m also seeing weird outputs on the loaded model. Would you mind letting me know what your output is for the test cases? i.e., the cell that starts with EXAMPLES =

hi @toni_1

I somehow missed your response.

Even I have the same plot.

and if you notice, below the plot, it provides in information

In the date translation application, you will observe that most of the time attention helps predict the year, and doesn’t have much impact on predicting the day or month.

So attention mechanism allows a network to focus on the most relevant parts of the input when producing a specific part of the output, so if you thinking the output which should come out should always have a probability value (alpha value) of 1, that would be incorrect.

All our graphs are correct and if you notice the input sequence to the output sequence the color hue does match. Alpha value is giving the probability value for the whole sequence and not just the attention value.

Hi Deepi_Prasad, I’m not sure that I followed your explanation. Let’s break down the question to a few subquestions:

  1. In the heatmap, are we plotting the alpha values?
  2. Are alpha values supposed to satisfy \Sigma_{t'} \alpha{<t, t'>} = 1?
  3. That implies that in the plot, I should see each row summing up to 1 right?

My code gives the same results as shown in the OP.

You can read the code for the “plot_attention_map()” function, by opening the “nmt_utils.py” file (using the File->Open menu).

Caution: That code is no picnic to decipher.

1 Like

Predictions for the “EXAMPLES”.

If I recall correctly, I believe that since the softmax is used on the possible output labels, the columns should sum to 1, not the rows.

The rows represents the entire input sequence for each of the output labels. There is no constraint on the sum of the inputs.

1 Like

Hi @TMosh @paulinpaloalto , so a few things

  1. I checked the lecture again, and \alpha < t, t' > is the amount of attention that y_t should pay to a_{t'}, and it should indeed be that each row sums up to 1. Let me know if I am misunderstanding something here
  2. I checked the util function, in line 245 of nmt_utils.py, it says attention_map = attention_map / row_max[:, None], which causes the heatmap to always have a maximum value of 1

Hope this helps @toni_1 as well