This can be traced by looking into the test - Transformer_test
dec_padding_mask is created from the output sequence:
sentence_lang_a = np.array([[2, 1, 4, 3, 0]])
sentence_lang_b = np.array([[3, 2, 1, 0, 0]])
enc_padding_mask = create_padding_mask(sentence_lang_a)
dec_padding_mask = create_padding_mask(sentence_lang_b)
It’s then used in the decoder’s cross-attention block:
dec_output, attention_weights = self.decoder(output_sentence, enc_output, training,
look_ahead_mask, dec_padding_mask)
x, block1, block2 = self.dec_layers[i](x, enc_output, training,
look_ahead_mask, padding_mask)
mult_attn_out2, attn_weights_block2 = self.mha2(
query=Q1, key=enc_output, value=enc_output, attention_mask=padding_mask, training=training, return_attention_scores=True)
In the mha2, keys are coming from encoder, therefore, have “input seq length size”, therefore, mask should be of the appropriate size as well.
looking at the def scaled_dot_product_attention(q, k, v, mask):
mask: Float tensor with shape broadcastable
to (..., seq_len_q, seq_len_k).
mask’s last dimension should be of size seq_len_k
, k comes from input seq, therefore it should be input_sentence_len
. Therefore - it’s a padding mask for the input sequence, not output.
Q: Am I missing something or is there indeed a mistake?