Course 5 Week 4 Assignment: Why are attention weights returned in DecoderLayer

This is achieved through the return_attention_scores=True argument. I would like to clarify the rational for outputing this. Normally, we don’t explicitly pull out the weights in the middle layer of a deep architecture. Is this solely for the purpose of running unit test such that you can assert on the weight? As far as the rest of coding, those scores don’t seem to be doing anything useful.

Another scenario I have seen is if someone want to induce a regularization loss for the weights, e.g. to keep the weights from going to large. However, I am not sure if this is mentioned in the video lecture, or the “Attn is all you need” paper.

I found that this important question is not answered.

I guess you may solve this by yourself, but, I’m trying to write for other learners who come to here by “searching” this community conversations.

Weights in the Encoder and the Decoder are quite important, and show relationships between source words and target words.

Here is an example from a Transformer implementation that we can find in Google Colab to translate Portuguese to English.

An input sentence In Portuguese is

este é o primeiro livro que eu fiz

And, expected output is

this is the first book i’ve ever done

(I’m not sure this is really correct, since I do not know Portuguese at all… :sweat_smile: Both are provided by Google Colab… )

A model is not well trained, and output is slightly different, but what I got is;

this is the first book i did n

Then, let’s look at the weights in the 2nd Multi-head-attention in the Decoder. There are 8 weights from 8 Multi-Head-Attention layer.

let’s look at the one of eight closely.

“the” and “first” are some good relationship with “primeiro” in Portuguese, and “book” is related to “livro”.

If we extract outputs from the first Multi-Head-Attention layer, as it is “self-attention”, the weights show relationships among English words.

Interesting thing is that an upper triangular matrix is really dark. This is actually masked by “look-ahead-mask” to avoid looking at the future words.
As it is a short sentence, it does not include much information, but you see “book” has some relations with “first”, and “first” has some relation with “the”.

Hope this clarifies some.