Dec_padding_mask

Hi @blackdragon

The name dec_padding_mask might have mislead you. The dec_padding_mask is used in the second Multi-Head attention (Cross-Attention):

It is used to let the decoder know which encoder inputs where padding.
For example, if the English sentence is “I love learning <pad> <pad> <pad> <pad> <pad> <pad>” the Encoder encodes this sequence with 8 tokens long (1, 8, n_units) (batch_size, seq_length, feature_size).
So, this mask informs the decoder that tokens starting with 4 are padding tokens and it should not pay any attention to them when generating the German sentence.

Cheers