In the optional part of the C5 W3 Neural Machine Translation assignment, specifically in section 3.1 - Getting the Attention Weights From the Network, I’m encountering an issue where the outputted attention values are completely incorrect. The condition \sum_{t'} \alpha^{\langle t, t' \rangle} = 1 is not satisfied, as shown in the attached image.
Despite this, I still received a 100% score from the grader. I’ve reviewed my code thoroughly and everything seems correct.
Has anyone else experienced a similar problem? Any insights or suggestions would be greatly appreciated.
Is it possible that you have output the attention weights from different queries? Maybe you can try to print the attention weight values from several different queries and check again.
Also, one need to understand here alpha value is probability output of the softmax and not a confirmation to equation attention weight values to get correct output. Hope you get this part.
@XinghaoZong I’ve tried other queries, but I’m encountering the same issue.
Looking at Figure 8: Full Attention Map in the notebook, it’s clear that each row should sum to 1. It seems like the softmax function applied to e^{\langle t, t' \rangle} isn’t working as expected.
I’m not entirely sure what @Deepti_Prasad means by “a confirmation to equation attention weight values to get correct output.”
If anyone has additional insights or suggestions, I would appreciate the help. Thanks for your responses!
Hi Toni, were you able to figure out? I think I might be facing a similar issue, as my attention map looks exactly the same as yours. Though I’m also seeing weird outputs on the loaded model. Would you mind letting me know what your output is for the test cases? i.e., the cell that starts with EXAMPLES =
and if you notice, below the plot, it provides in information
In the date translation application, you will observe that most of the time attention helps predict the year, and doesn’t have much impact on predicting the day or month.
So attention mechanism allows a network to focus on the most relevant parts of the input when producing a specific part of the output, so if you thinking the output which should come out should always have a probability value (alpha value) of 1, that would be incorrect.
All our graphs are correct and if you notice the input sequence to the output sequence the color hue does match. Alpha value is giving the probability value for the whole sequence and not just the attention value.
I checked the lecture again, and \alpha < t, t' > is the amount of attention that y_t should pay to a_{t'}, and it should indeed be that each row sums up to 1. Let me know if I am misunderstanding something here
I checked the util function, in line 245 of nmt_utils.py, it says attention_map = attention_map / row_max[:, None], which causes the heatmap to always have a maximum value of 1