For future learners - the OP’s mistake was in defining the dec_padding_mask
.
Note, that when creating padding mask for the decoder’s second attention block - we use the encoder_input
. In other words, we inform the decoder to not pay attention to padding tokens of the document to be summarized.
Also note, that this is different from look_ahead_mask
(causal mask) where decoder is only allowed to pay attention to itself and its previous tokens.
Cheers