Attention Weights Issue in C5 W3 Neural Machine Translation Assignment (Section 3.1)

toni_1 · July 31, 2024, 8:24pm

In the optional part of the C5 W3 Neural Machine Translation assignment, specifically in section 3.1 - Getting the Attention Weights From the Network, I’m encountering an issue where the outputted attention values are completely incorrect. The condition \sum_{t'} \alpha^{\langle t, t' \rangle} = 1 is not satisfied, as shown in the attached image.

Despite this, I still received a 100% score from the grader. I’ve reviewed my code thoroughly and everything seems correct.

Has anyone else experienced a similar problem? Any insights or suggestions would be greatly appreciated.

Thanks!

XinghaoZong · August 5, 2024, 7:31pm

Hi @toni_1 !

Is it possible that you have output the attention weights from different queries? Maybe you can try to print the attention weight values from several different queries and check again.

Deepti_Prasad · August 6, 2024, 8:18am

Also, one need to understand here alpha value is probability output of the softmax and not a confirmation to equation attention weight values to get correct output. Hope you get this part.

toni_1 · August 12, 2024, 7:54pm

@XinghaoZong I’ve tried other queries, but I’m encountering the same issue.

Looking at Figure 8: Full Attention Map in the notebook, it’s clear that each row should sum to 1. It seems like the softmax function applied to e^{\langle t, t' \rangle} isn’t working as expected.

I’m not entirely sure what @Deepti_Prasad means by “a confirmation to equation attention weight values to get correct output.”

If anyone has additional insights or suggestions, I would appreciate the help. Thanks for your responses!

depl · January 21, 2025, 2:01pm

Hi Toni, were you able to figure out? I think I might be facing a similar issue, as my attention map looks exactly the same as yours. Though I’m also seeing weird outputs on the loaded model. Would you mind letting me know what your output is for the test cases? i.e., the cell that starts with EXAMPLES =

Deepti_Prasad · January 21, 2025, 4:35pm

hi @toni_1

I somehow missed your response.

Even I have the same plot.

and if you notice, below the plot, it provides in information

In the date translation application, you will observe that most of the time attention helps predict the year, and doesn’t have much impact on predicting the day or month.

So attention mechanism allows a network to focus on the most relevant parts of the input when producing a specific part of the output, so if you thinking the output which should come out should always have a probability value (alpha value) of 1, that would be incorrect.

All our graphs are correct and if you notice the input sequence to the output sequence the color hue does match. Alpha value is giving the probability value for the whole sequence and not just the attention value.

depl · January 22, 2025, 4:05am

Hi Deepi_Prasad, I’m not sure that I followed your explanation. Let’s break down the question to a few subquestions:

In the heatmap, are we plotting the alpha values?
Are alpha values supposed to satisfy \Sigma_{t'} \alpha{<t, t'>} = 1?
That implies that in the plot, I should see each row summing up to 1 right?

TMosh · January 22, 2025, 4:45am

My code gives the same results as shown in the OP.

You can read the code for the “plot_attention_map()” function, by opening the “nmt_utils.py” file (using the File->Open menu).

Caution: That code is no picnic to decipher.

TMosh · January 22, 2025, 4:51am

Predictions for the “EXAMPLES”.

TMosh · January 22, 2025, 4:55am

If I recall correctly, I believe that since the softmax is used on the possible output labels, the columns should sum to 1, not the rows.

The rows represents the entire input sequence for each of the output labels. There is no constraint on the sum of the inputs.

depl · January 23, 2025, 1:54pm

Hi @TMosh @paulinpaloalto , so a few things

I checked the lecture again, and \alpha < t, t' > is the amount of attention that y_t should pay to a_{t'}, and it should indeed be that each row sums up to 1. Let me know if I am misunderstanding something here
I checked the util function, in line 245 of nmt_utils.py, it says attention_map = attention_map / row_max[:, None], which causes the heatmap to always have a maximum value of 1

Hope this helps @toni_1 as well

Topic		Replies	Views
C5 W3 A1 Neural machine translation, got 100 points but model performing weirdly Sequence Models week-3	10	37	January 22, 2025
Typo in machine translation assignment? Sequence Models	2	515	March 14, 2022
C5W3A1 Attention Assignment unexpected test result Sequence Models	1	544	September 27, 2022
C5W3A1 - Passing Everything but Output is Wrong Sequence Models	2	540	May 14, 2023
Neural_machine_translation_with_attention Ex1 Unexpected values in the result Sequence Models	2	593	November 8, 2021

Attention Weights Issue in C5 W3 Neural Machine Translation Assignment (Section 3.1)

Related topics