Padding Mask newaxis confusion

Varun_Rathi · July 5, 2021, 1:23pm

I am not able to understand why are we adding newaxis in the padding mask as below to make the dimension [B,1,1,Seq_len] (where seq len is the number of words in the sequence) from [B,Seq_len]

seq[:, tf.newaxis, tf.newaxis, :]

Now here our things are
x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])
Mask is

[[[[1. 1. 0. 0. 1.]]]


 [[[1. 1. 1. 0. 0.]]]


 [[[0. 0. 0. 1. 1.]]]]

here we want to have very less values in the place of Zeros in our x vector in order to do the mapping so that when we do softmax its effect is not seen

[[ 7.e+00,  6.e+00, -1.e+09, -1.e+09,  1.e+00],
       [ 1.e+00,  2.e+00,  3.e+00, -1.e+09, -1.e+09],
       [-1.e+09, -1.e+09, -1.e+09,  4.e+00,  5.e+00]]

Instead we get

[[[[ 7.e+00,  6.e+00, -1.e+09, -1.e+09,  1.e+00],
         [ 1.e+00,  2.e+00, -1.e+09, -1.e+09,  0.e+00],
         [ 0.e+00,  0.e+00, -1.e+09, -1.e+09,  5.e+00]]],


       [[[ 7.e+00,  6.e+00,  0.e+00, -1.e+09, -1.e+09],
         [ 1.e+00,  2.e+00,  3.e+00, -1.e+09, -1.e+09],
         [ 0.e+00,  0.e+00,  0.e+00, -1.e+09, -1.e+09]]],


       [[[-1.e+09, -1.e+09, -1.e+09,  0.e+00,  1.e+00],
         [-1.e+09, -1.e+09, -1.e+09,  0.e+00,  0.e+00],
         [-1.e+09, -1.e+09, -1.e+09,  4.e+00,  5.e+00]]]]

This is before applying softmax

edwardyu · July 6, 2021, 2:48am

The scaled_dot_product_attention function is a part of MultiHeadAttention, and expects input arguments q, k, v with shape (batch_size, number_of_heads, seq_len, depth). Since you creates mask with input shape (3, 5), it implies batch_size=3, seq_len=5. The logit before masking, is the multiplication of q and k transpose, thus, the shape of logit becomes (3, 3). It can’t apply masking with shape (3, 1, 1, 5) mask.

Varun_Rathi · July 6, 2021, 5:41am

Thank you edwardyu for the explanation!!
But here due to this another doubt arises!!
My understanding is

(batch_size=Total number of documents or sentences present in training, 
number_of_heads=Total number of Heads we will be generating using multihead attention,
seq_len= total number of words + padding in a sentence , 
depth= dimension of individual word vectors)

We are feeding our transformers an Input of a fixed sequence i.e. if the length is smaller than a particular sequence length then it will be padded with zero vectors!! SO here the broadcasting should be done on the basis of the Batch as well as Sequence_len but it is done on the basis of Depth

Now either my understanding of variables is incorrect or The concept is altogether wrong!! Please provide me some understanding if it is the later case!!

edwardyu · July 6, 2021, 6:37am

Sorry! It’s my fault. I fixed it.
In fact, you should not able to apply masking on it. I guess, it may be something wrong in between. The value you shown (logits before applying softmax), looks like simply add x and mask. When you add two matrices with shape (3, 5) and (3, 1, 1, 5), the result will become (3, 1, 3, 5). Here are some broadcasting rules for your reference.

Varun_Rathi · July 6, 2021, 7:15am

Hi Edwardyu!!
Thanks again for replying!!
After reading your last comment it seems there is a problem with the assignment because create_padding_mask function is not able to mask the x !! Is it so?

edwardyu · July 6, 2021, 8:29am

Yes, you can, but you cannot take the pair of x and create_padding_mask(x) as function arguments of scaled_dot_product_attention. If you take a look at Encoder (UNQ_C5) carefully, you’ll see there is a embedding layer before MultiHeadAttention. The embedding layer transforms x with shape (batch_size, seq_len) into shape (batch_size, seq_len, depth). The MultiHeadAttention layer further splits it into shape (batch_size, head_numbers, seq_len, depth) before applying scaled_dot_product_attention.
As you can see in the scaled_dot_product_attention function, q multiplied by k transpose, the shape becomes (barch_size, head_numbers, seq_len, seq_len), then applies masking (barch_size, 1, 1, seq_len).
So, the pair of x and create_padding_mask(x) is the inputs of Encoder, rather than scaled_dot_product_attention.

Topic		Replies	Views
Week 4 Transformer create_padding_mask function Sequence Models	1	544	September 22, 2021
Padding mask dimmensions Sequence Models	5	504	May 4, 2023
C5w4 2.1 Padding mask Sequence Models week-4	9	288	March 9, 2024
Why does applying the padding mask change the tensor's shape C5W4Asn1 Sequence Models	2	555	January 21, 2023
Create_padding_mask() function Sequence Models week-4	3	29	August 16, 2024

Padding Mask newaxis confusion

Related topics