Padding Mask newaxis confusion

I am not able to understand why are we adding newaxis in the padding mask as below to make the dimension [B,1,1,Seq_len] (where seq len is the number of words in the sequence) from [B,Seq_len]

seq[:, tf.newaxis, tf.newaxis, :]

Now here our things are
x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])
Mask is

[[[[1. 1. 0. 0. 1.]]]

 [[[1. 1. 1. 0. 0.]]]

 [[[0. 0. 0. 1. 1.]]]]

here we want to have very less values in the place of Zeros in our x vector in order to do the mapping so that when we do softmax its effect is not seen

[[ 7.e+00,  6.e+00, -1.e+09, -1.e+09,  1.e+00],
       [ 1.e+00,  2.e+00,  3.e+00, -1.e+09, -1.e+09],
       [-1.e+09, -1.e+09, -1.e+09,  4.e+00,  5.e+00]]

Instead we get

[[[[ 7.e+00,  6.e+00, -1.e+09, -1.e+09,  1.e+00],
         [ 1.e+00,  2.e+00, -1.e+09, -1.e+09,  0.e+00],
         [ 0.e+00,  0.e+00, -1.e+09, -1.e+09,  5.e+00]]],

       [[[ 7.e+00,  6.e+00,  0.e+00, -1.e+09, -1.e+09],
         [ 1.e+00,  2.e+00,  3.e+00, -1.e+09, -1.e+09],
         [ 0.e+00,  0.e+00,  0.e+00, -1.e+09, -1.e+09]]],

       [[[-1.e+09, -1.e+09, -1.e+09,  0.e+00,  1.e+00],
         [-1.e+09, -1.e+09, -1.e+09,  0.e+00,  0.e+00],
         [-1.e+09, -1.e+09, -1.e+09,  4.e+00,  5.e+00]]]]

This is before applying softmax

The scaled_dot_product_attention function is a part of MultiHeadAttention, and expects input arguments q, k, v with shape (batch_size, number_of_heads, seq_len, depth). Since you creates mask with input shape (3, 5), it implies batch_size=3, seq_len=5. The logit before masking, is the multiplication of q and k transpose, thus, the shape of logit becomes (3, 3). It can’t apply masking with shape (3, 1, 1, 5) mask.

Thank you edwardyu for the explanation!!
But here due to this another doubt arises!!
My understanding is

(batch_size=Total number of documents or sentences present in training, 
number_of_heads=Total number of Heads we will be generating using multihead attention,
seq_len= total number of words + padding in a sentence , 
depth= dimension of individual word vectors)

We are feeding our transformers an Input of a fixed sequence i.e. if the length is smaller than a particular sequence length then it will be padded with zero vectors!! SO here the broadcasting should be done on the basis of the Batch as well as Sequence_len but it is done on the basis of Depth

Now either my understanding of variables is incorrect or The concept is altogether wrong!! Please provide me some understanding if it is the later case!!

Sorry! It’s my fault. I fixed it.
In fact, you should not able to apply masking on it. I guess, it may be something wrong in between. The value you shown (logits before applying softmax), looks like simply add x and mask. When you add two matrices with shape (3, 5) and (3, 1, 1, 5), the result will become (3, 1, 3, 5). Here are some broadcasting rules for your reference.

Hi Edwardyu!!
Thanks again for replying!!
After reading your last comment it seems there is a problem with the assignment because create_padding_mask function is not able to mask the x !! Is it so?

Yes, you can, but you cannot take the pair of x and create_padding_mask(x) as function arguments of scaled_dot_product_attention. If you take a look at Encoder (UNQ_C5) carefully, you’ll see there is a embedding layer before MultiHeadAttention. The embedding layer transforms x with shape (batch_size, seq_len) into shape (batch_size, seq_len, depth). The MultiHeadAttention layer further splits it into shape (batch_size, head_numbers, seq_len, depth) before applying scaled_dot_product_attention.
As you can see in the scaled_dot_product_attention function, q multiplied by k transpose, the shape becomes (barch_size, head_numbers, seq_len, seq_len), then applies masking (barch_size, 1, 1, seq_len).
So, the pair of x and create_padding_mask(x) is the inputs of Encoder, rather than scaled_dot_product_attention.