Clarification on dec_padding_mask

zhiyong9654867 · April 3, 2022, 7:51am

I’m trying to solidify my understanding of the various masks applied for the encoder/decoder blocks in transformers.

So far, I can grasp that padding mask is required for both encoder and decoder blocks to ensure the padded values are scaled to -infinity for softmax to neglect them.

The look ahead mask is applied during training time, when the full sequence is input into the decoder block, hence look ahead mask is required to block out portions of the text so the next word isn’t leaked.

So for example, taking the coding example of the the transformers’ call method:
dec_output, attention_weights = self.decoder(output_sentence, enc_output, training, look_ahead_mask, dec_padding_mask)

The look_ahead_mask is supposed to translate:
[23445, 645224, 2310734, 23406, 34072] → [23445, 645224, 2310734, -inf, -inf]

But for dec_padding_mask, it is applied on the output of the 2nd MHA block, (input: K/V from encoder and Q from 1st MHA block). Why is this 2nd pad needed?

My current thoughts are that, its there to apply the EXACT same pad masking as was applied to encoder, since K,Q,V into the 2nd MHA block can be any value following the scaling from prior Add & Norm layers.

In summary, is dec_padding_mask == enc_padding_mask? (I saw that it was in the programming exercise but want to be sure)

reinoudbosch · April 6, 2022, 1:36am

Hi zhiyong9654867,

The multi-head attention layer that is used in both the encoder and the decoder has the call argument ‘attention_mask’ that serves to indicate which positions in the keys are to be attended to. A value of 0 indicates no attention should be paid. For the positions in the original text, this holds from start to finish, because the number of positions attended to should not increase in the process (as the number of positions in the original text does not increase). So the padding mask in the decoder to be applied to the outputs of the encoder should be the same as the padding mask in the encoder, which is in line with your current thoughts.

This link may be interesting to you.

Topic		Replies	Views
C5 W4 - Mistake in Transformer dec_padding_mask Sequence Models week-4 , coursera-platform	2	192	March 9, 2024
Dec_padding_mask NLP with Attention Models week-2	1	306	February 6, 2024
Decoder padding Mask Transformer Architecture Sequence Models coursera-platform	1	504	May 17, 2022
C5_W4_A1_Transformer_Subclass_v1: Why is look_ahead mask used before the padding_mask Sequence Models coursera-platform	6	588	November 1, 2023
Please explain the comment: Notice that both encoder and decoder padding masks are equal NLP with Attention Models week-2	2	300	March 4, 2024

Clarification on dec_padding_mask

Related topics