I’m trying to solidify my understanding of the various masks applied for the encoder/decoder blocks in transformers.
So far, I can grasp that padding mask is required for both encoder and decoder blocks to ensure the padded values are scaled to -infinity for softmax to neglect them.
The look ahead mask is applied during training time, when the full sequence is input into the decoder block, hence look ahead mask is required to block out portions of the text so the next word isn’t leaked.
So for example, taking the coding example of the the transformers’ call method:
dec_output, attention_weights = self.decoder(output_sentence, enc_output, training, look_ahead_mask, dec_padding_mask)
The look_ahead_mask is supposed to translate:
[23445, 645224, 2310734, 23406, 34072] → [23445, 645224, 2310734, -inf, -inf]
But for dec_padding_mask, it is applied on the output of the 2nd MHA block, (input: K/V from encoder and Q from 1st MHA block). Why is this 2nd pad needed?
My current thoughts are that, its there to apply the EXACT same pad masking as was applied to encoder, since K,Q,V into the 2nd MHA block can be any value following the scaling from prior Add & Norm layers.
In summary, is dec_padding_mask == enc_padding_mask? (I saw that it was in the programming exercise but want to be sure)