C5 W4 - Mistake in Transformer dec_padding_mask

This can be traced by looking into the test - Transformer_test

dec_padding_mask is created from the output sequence:

    sentence_lang_a = np.array([[2, 1, 4, 3, 0]])
    sentence_lang_b = np.array([[3, 2, 1, 0, 0]])

    enc_padding_mask = create_padding_mask(sentence_lang_a)
    dec_padding_mask = create_padding_mask(sentence_lang_b)

It’s then used in the decoder’s cross-attention block:

    dec_output, attention_weights = self.decoder(output_sentence, enc_output, training, 
           look_ahead_mask, dec_padding_mask)
x, block1, block2 = self.dec_layers[i](x, enc_output, training, 
                                                   look_ahead_mask, padding_mask)
mult_attn_out2, attn_weights_block2 = self.mha2(
            query=Q1, key=enc_output, value=enc_output, attention_mask=padding_mask, training=training, return_attention_scores=True)

In the mha2, keys are coming from encoder, therefore, have “input seq length size”, therefore, mask should be of the appropriate size as well.

looking at the def scaled_dot_product_attention(q, k, v, mask):

 mask: Float tensor with shape broadcastable 
              to (..., seq_len_q, seq_len_k).

mask’s last dimension should be of size seq_len_k, k comes from input seq, therefore it should be input_sentence_len. Therefore - it’s a padding mask for the input sequence, not output.

Q: Am I missing something or is there indeed a mistake?

Easy proof:
I copied the test and added one more token to the output sequence:


It now results in an error:

This is a documentation error in the notebook text. It has been reported already and will hopefully be fixed soon.

1 Like