I have a question regarding the Block 1 of DecoderLayer.
The hint suggested that only the look-ahead mask is needed for block 1 and I did pass all the unit tests by only passing the look-ahead mask.
However, I wonder why the padding mask is not needed in the self-attention layer of a decoder. Isn’t padding mask always needed in order to guarantee that all sequences within a batch have the same length?
However, when I tried the following code, I got a couple of failed unit tests for test_transformer.
combined_mask = tf.maximum(padding_mask, look_ahead_mask[:, tf.newaxis, :])
mult_attn_out1, attn_weights_block1 = self.mha1(x, x, x, combined_mask, return_attention_scores=True)