You probably made the same mistake.
Note, that when creating padding mask for the decoder’s second attention block - we use the encoder_input
. In other words, we inform the decoder to not pay attention to padding tokens of the document to be summarized.
Also note, that this is different from look_ahead_mask
(causal mask) where decoder is only allowed to pay attention to itself and its previous tokens.
Cheers