Padding mask dimmensions

Rodolfo_Novarini · May 3, 2023, 5:44pm

Why does the padding mask has one additional dimension?
From the description in C5_W4_A1_Transformer_Subclass_v1 Section 2.1 it’s not obvious to me that the result needs to have an extra dimension. Can you explain the purpose and how are the values of this extra dimension being calculated?

I can see the code but I do not understand what it’s doing:
seq = 1 - tf.cast(tf.math.equal(decoder_token_ids, 0), tf.float32)
seq[:, tf.newaxis, :]

Thanks,

balaji.ambresh · May 4, 2023, 6:36pm

Thanks for bringing this up. The staff have been notified about this.

Rodolfo_Novarini · May 4, 2023, 6:53pm

Not sure this was the intended reply for this topic as I was asking for help in understanding what the code is doing and why.

To expand on my question, can you help me understand this example?
x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])
print((1-create_padding_mask(x))* -1.0e9)
print(x + (1 - create_padding_mask(x)) * -1.0e9)
→
tf.Tensor(
[[[-0.e+00 -0.e+00 -1.e+09 -1.e+09 -0.e+00]]

[[-0.e+00 -0.e+00 -0.e+00 -1.e+09 -1.e+09]]

[[-1.e+09 -1.e+09 -1.e+09 -0.e+00 -0.e+00]]], shape=(3, 1, 5), dtype=float32)

tf.Tensor(
[[[ 7.e+00 6.e+00 -1.e+09 -1.e+09 1.e+00]
[ 1.e+00 2.e+00 -1.e+09 -1.e+09 0.e+00]
[ 0.e+00 0.e+00 -1.e+09 -1.e+09 5.e+00]]

[[ 7.e+00 6.e+00 0.e+00 -1.e+09 -1.e+09]
[ 1.e+00 2.e+00 3.e+00 -1.e+09 -1.e+09]
[ 0.e+00 0.e+00 0.e+00 -1.e+09 -1.e+09]]

[[-1.e+09 -1.e+09 -1.e+09 0.e+00 1.e+00]
[-1.e+09 -1.e+09 -1.e+09 0.e+00 0.e+00]
[-1.e+09 -1.e+09 -1.e+09 4.e+00 5.e+00]]], shape=(3, 3, 5), dtype=float32)

Why is the 3x5 tensor being transformed into a 3x3x5 tensor?
I can see the original x tensor with -1.e+09 fill where there was a zero before, but what are all these other values?
It looks that some kind of casting happens when adding a 3x5 tensor to a 3x3x5 tensor but:

I do not understand really what is going on. Help would be appreciated here.
I do not understand what is the purpose and the effect of having the output of the mask in this way.

balaji.ambresh · May 4, 2023, 7:14pm

Sorry for not being clear.

What I meant was that the staff have been notified to update the documentation and the function (if required).

For starters,

x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])

doesn’t make sense to me since 0 is usually used as a padding token. This means that 0s should be present at the trailing part of each inner array.

The following example is more inline with the transformer architecture:

If integer encoded and padded batch looks like this for 2 sentences:

tf.constant([[12, 13, 0, 0, 0],
        [11, 12, 1, 0, 0]])

the padding mask should be

<tf.Tensor: shape=(2, 5, 5), dtype=int32, numpy=
array([[[1, 1, 0, 0, 0],
        [1, 1, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[1, 1, 1, 0, 0],
        [1, 1, 1, 0, 0],
        [1, 1, 1, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]], dtype=int32)>

Please implement scaled_dot_product_attention and you’ll see how mask is used.

Rodolfo_Novarini · May 4, 2023, 9:30pm

I can see that in scaled_dot_product_attention the mask is:

treated to replace ones with zeros and zeros with -1.0e9, using (1. - mask) * -1.0e9
added to scaled_attention_logits with shape seq_len_q x seq_len_k

Also, when I run your example through ‘create_padding_mask’ I get:
tf.Tensor(
[[[1. 1. 0. 0. 0.]]

[[1. 1. 1. 0. 0.]]], shape=(2, 1, 5), dtype=float32)
Instead of the 2x5x5 you show above.

And when I run this by ‘x + (1 - create_padding_mask(x)) * -1.0e9’ I get:
tf.Tensor(
[[[ 1.2e+01 1.3e+01 -1.0e+09 -1.0e+09 -1.0e+09]
[ 1.1e+01 1.2e+01 -1.0e+09 -1.0e+09 -1.0e+09]]

[[ 1.2e+01 1.3e+01 0.0e+00 -1.0e+09 -1.0e+09]
[ 1.1e+01 1.2e+01 1.0e+00 -1.0e+09 -1.0e+09]]], shape=(2, 2, 5), dtype=float32)

But I still don’t get why are we adding an additional dimension and how works the addition of two tensors of (2,1,5) and (2, 5) shapes.

balaji.ambresh · May 4, 2023, 10:46pm

The example provided is regarding the suggested behavior and not the current notebook behavior.

As far as adding dissimilar shapes is concerned, please see this link on broadcasting.

Topic		Replies	Views
C5 W4 A1: Why is create_padding_mask adding more dimensions than its supposed to? Sequence Models coursera-platform	1	475	May 31, 2023
Why is extra dim needed in create_padding_mask in Transformer Network assignment Sequence Models coursera-platform	2	601	March 15, 2025
Why does applying the padding mask change the tensor's shape C5W4Asn1 Sequence Models coursera-platform	2	565	January 21, 2023
Mistake in W4A1 - 2.1 Masking example Sequence Models coursera-platform	2	273	November 28, 2023
Padding Mask newaxis confusion Sequence Models coursera-platform	5	601	July 6, 2021

Padding mask dimmensions

Related topics