It’s a good question.
In simple words - we inform the decoder to not pay attention to padding tokens of the document to be summarized.
Also note, that this is different from look_ahead_mask
(causal mask) where decoder is only allowed to pay attention to itself and its previous tokens.
Check my previous explanation with an example it might add clarity.
Cheers