Why is extra dim needed in create_padding_mask in Transformer Network assignment

In the programming exercise:

I was thinking the mask should have the same shape as the input “seq”. Why are 2 new dimensions are needed? Also, as a result, the 2 different softmax calculated next have different shapes? [3, 5] vs. [3, 1, 3, 5]??

Mask is used in scaled_dot_product_attention, and added to (q multiplied by k transpose), as a result, its shape has to be compatible with (batch_size, header_nums, seq_len, seq_len).
By the way, the softmax calculation example w/ and w/o mask is just to demonstrate how it works. In fact, we won’t add mask to input sequence x directly in MultiHeadAttention.