I found that this important question is not answered.
I guess you may solve this by yourself, but, I’m trying to write for other learners who come to here by “searching” this community conversations.
Weights in the Encoder and the Decoder are quite important, and show relationships between source words and target words.
Here is an example from a Transformer implementation that we can find in Google Colab to translate Portuguese to English.
An input sentence In Portuguese is
este é o primeiro livro que eu fiz
And, expected output is
this is the first book i’ve ever done
(I’m not sure this is really correct, since I do not know Portuguese at all… Both are provided by Google Colab… )
A model is not well trained, and output is slightly different, but what I got is;
this is the first book i did n
Then, let’s look at the weights in the 2nd Multi-head-attention in the Decoder. There are 8 weights from 8 Multi-Head-Attention layer.
let’s look at the one of eight closely.
“the” and “first” are some good relationship with “primeiro” in Portuguese, and “book” is related to “livro”.
If we extract outputs from the first Multi-Head-Attention layer, as it is “self-attention”, the weights show relationships among English words.
Interesting thing is that an upper triangular matrix is really dark. This is actually masked by “look-ahead-mask” to avoid looking at the future words.
As it is a short sentence, it does not include much information, but you see “book” has some relations with “first”, and “first” has some relation with “the”.
Hope this clarifies some.