I’m doing the programming exercise of week 4 in the sequence models course. I can not understand the use of the padding mask: what is the effect of 0 in a softmax function and why we add 1 additional dimension in the mask at the end ?
Thank you for your helps !
For an answer to your first question, you can have a look here.
As I understand it, the additional dimension serves to allow the padding mask to be added to the attention logits.