C5W4: padding mask in transformer

Hello everyone.
I’m doing the programming exercise of week 4 in the sequence models course. I can not understand the use of the padding mask: what is the effect of 0 in a softmax function and why we add 1 additional dimension in the mask at the end ?
Thank you for your helps !

Hi Huynh_Tan_Khiem,

For an answer to your first question, you can have a look here.

As I understand it, the additional dimension serves to allow the padding mask to be added to the attention logits.