I do not understand the comment (train_step function in the C4W2_Assignment). Why both encoder and decoder padding masks are equal? In other words why
enc_padding_mask= dec_padding_mask = create_padding_mask(encoder_input)
It’s a good question.
In simple words - we inform the decoder to not pay attention to padding tokens of the document to be summarized.
Also note, that this is different from look_ahead_mask
(causal mask) where decoder is only allowed to pay attention to itself and its previous tokens.
Check my previous explanation with an example it might add clarity.
Cheers
1 Like
Understood, thank you
1 Like