Course 5 Week 4 Assignment: Why are attention weights returned in DecoderLayer

kechan · July 17, 2021, 4:29pm

This is achieved through the return_attention_scores=True argument. I would like to clarify the rational for outputing this. Normally, we don’t explicitly pull out the weights in the middle layer of a deep architecture. Is this solely for the purpose of running unit test such that you can assert on the weight? As far as the rest of coding, those scores don’t seem to be doing anything useful.

Another scenario I have seen is if someone want to induce a regularization loss for the weights, e.g. to keep the weights from going to large. However, I am not sure if this is mentioned in the video lecture, or the “Attn is all you need” paper.

anon57530071 · June 29, 2022, 3:58am

I found that this important question is not answered.

I guess you may solve this by yourself, but, I’m trying to write for other learners who come to here by “searching” this community conversations.

Weights in the Encoder and the Decoder are quite important, and show relationships between source words and target words.

Here is an example from a Transformer implementation that we can find in Google Colab to translate Portuguese to English.

An input sentence In Portuguese is

este é o primeiro livro que eu fiz

And, expected output is

this is the first book i’ve ever done

(I’m not sure this is really correct, since I do not know Portuguese at all… Both are provided by Google Colab… )

A model is not well trained, and output is slightly different, but what I got is;

this is the first book i did n

Then, let’s look at the weights in the 2nd Multi-head-attention in the Decoder. There are 8 weights from 8 Multi-Head-Attention layer.

let’s look at the one of eight closely.

“the” and “first” are some good relationship with “primeiro” in Portuguese, and “book” is related to “livro”.

If we extract outputs from the first Multi-Head-Attention layer, as it is “self-attention”, the weights show relationships among English words.

Interesting thing is that an upper triangular matrix is really dark. This is actually masked by “look-ahead-mask” to avoid looking at the future words.
As it is a short sentence, it does not include much information, but you see “book” has some relations with “first”, and “first” has some relation with “the”.

Hope this clarifies some.

Topic		Replies	Views
W4 - Assignment: Why do we only update the attention weights in the decoder, but not in the encoder? Sequence Models	2	534	December 2, 2022
C4W2_Assignment_Transformer Summarizer_Exercise 2 - DecoderLayer NLP with Attention Models week-4	4	16	December 3, 2024
C5_W4_A1_exercise6 Sequence Models week-4	2	133	May 28, 2024
Clarification regarding attention and self attention Sequence Models	4	591	August 22, 2021
Attention Weights Issue in C5 W3 Neural Machine Translation Assignment (Section 3.1) Sequence Models week-3	3	91	August 12, 2024

Course 5 Week 4 Assignment: Why are attention weights returned in DecoderLayer

Related topics