Clarification on dec_padding_mask

I’m trying to solidify my understanding of the various masks applied for the encoder/decoder blocks in transformers.

So far, I can grasp that padding mask is required for both encoder and decoder blocks to ensure the padded values are scaled to -infinity for softmax to neglect them.

The look ahead mask is applied during training time, when the full sequence is input into the decoder block, hence look ahead mask is required to block out portions of the text so the next word isn’t leaked.

So for example, taking the coding example of the the transformers’ call method:
dec_output, attention_weights = self.decoder(output_sentence, enc_output, training, look_ahead_mask, dec_padding_mask)

The look_ahead_mask is supposed to translate:
[23445, 645224, 2310734, 23406, 34072] → [23445, 645224, 2310734, -inf, -inf]

But for dec_padding_mask, it is applied on the output of the 2nd MHA block, (input: K/V from encoder and Q from 1st MHA block). Why is this 2nd pad needed?

My current thoughts are that, its there to apply the EXACT same pad masking as was applied to encoder, since K,Q,V into the 2nd MHA block can be any value following the scaling from prior Add & Norm layers.

In summary, is dec_padding_mask == enc_padding_mask? (I saw that it was in the programming exercise but want to be sure)

Hi zhiyong9654867,

The multi-head attention layer that is used in both the encoder and the decoder has the call argument ‘attention_mask’ that serves to indicate which positions in the keys are to be attended to. A value of 0 indicates no attention should be paid. For the positions in the original text, this holds from start to finish, because the number of positions attended to should not increase in the process (as the number of positions in the original text does not increase). So the padding mask in the decoder to be applied to the outputs of the encoder should be the same as the padding mask in the encoder, which is in line with your current thoughts.

This link may be interesting to you.