For exercise 5 (next_word function), why can’t we pass in the output into the create_padding_mask function when defining dec_padding_mask? Isn’t “output” here the decoder input which is what is supposed to be sent?

Hi @blackdragon

The name dec_padding_mask might have mislead you. The dec_padding_mask is used in the second Multi-Head attention (Cross-Attention):

It is used to let the decoder know which encoder inputs where padding.
For example, if the English sentence is “I love learning <pad> <pad> <pad> <pad> <pad> <pad>” the Encoder encodes this sequence with 8 tokens long (1, 8, n_units) (batch_size, seq_length, feature_size).
So, this mask informs the decoder that tokens starting with 4 are padding tokens and it should not pay any attention to them when generating the German sentence.