Padding mask dimmensions

Why does the padding mask has one additional dimension?
From the description in C5_W4_A1_Transformer_Subclass_v1 Section 2.1 it’s not obvious to me that the result needs to have an extra dimension. Can you explain the purpose and how are the values of this extra dimension being calculated?

I can see the code but I do not understand what it’s doing:
seq = 1 - tf.cast(tf.math.equal(decoder_token_ids, 0), tf.float32)
seq[:, tf.newaxis, :]

Thanks,

Thanks for bringing this up. The staff have been notified about this.

Not sure this was the intended reply for this topic as I was asking for help in understanding what the code is doing and why.

To expand on my question, can you help me understand this example?
x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])
print((1-create_padding_mask(x))* -1.0e9)
print(x + (1 - create_padding_mask(x)) * -1.0e9)

tf.Tensor(
[[[-0.e+00 -0.e+00 -1.e+09 -1.e+09 -0.e+00]]

[[-0.e+00 -0.e+00 -0.e+00 -1.e+09 -1.e+09]]

[[-1.e+09 -1.e+09 -1.e+09 -0.e+00 -0.e+00]]], shape=(3, 1, 5), dtype=float32)

tf.Tensor(
[[[ 7.e+00 6.e+00 -1.e+09 -1.e+09 1.e+00]
[ 1.e+00 2.e+00 -1.e+09 -1.e+09 0.e+00]
[ 0.e+00 0.e+00 -1.e+09 -1.e+09 5.e+00]]

[[ 7.e+00 6.e+00 0.e+00 -1.e+09 -1.e+09]
[ 1.e+00 2.e+00 3.e+00 -1.e+09 -1.e+09]
[ 0.e+00 0.e+00 0.e+00 -1.e+09 -1.e+09]]

[[-1.e+09 -1.e+09 -1.e+09 0.e+00 1.e+00]
[-1.e+09 -1.e+09 -1.e+09 0.e+00 0.e+00]
[-1.e+09 -1.e+09 -1.e+09 4.e+00 5.e+00]]], shape=(3, 3, 5), dtype=float32)

Why is the 3x5 tensor being transformed into a 3x3x5 tensor?
I can see the original x tensor with -1.e+09 fill where there was a zero before, but what are all these other values?
It looks that some kind of casting happens when adding a 3x5 tensor to a 3x3x5 tensor but:

  1. I do not understand really what is going on. Help would be appreciated here.
  2. I do not understand what is the purpose and the effect of having the output of the mask in this way.

Sorry for not being clear.

What I meant was that the staff have been notified to update the documentation and the function (if required).

For starters,

x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])

doesn’t make sense to me since 0 is usually used as a padding token. This means that 0s should be present at the trailing part of each inner array.

The following example is more inline with the transformer architecture:

If integer encoded and padded batch looks like this for 2 sentences:

tf.constant([[12, 13, 0, 0, 0],
        [11, 12, 1, 0, 0]])

the padding mask should be

<tf.Tensor: shape=(2, 5, 5), dtype=int32, numpy=
array([[[1, 1, 0, 0, 0],
        [1, 1, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[1, 1, 1, 0, 0],
        [1, 1, 1, 0, 0],
        [1, 1, 1, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]], dtype=int32)>

Please implement scaled_dot_product_attention and you’ll see how mask is used.

I can see that in scaled_dot_product_attention the mask is:

  1. treated to replace ones with zeros and zeros with -1.0e9, using (1. - mask) * -1.0e9
  2. added to scaled_attention_logits with shape seq_len_q x seq_len_k

Also, when I run your example through ‘create_padding_mask’ I get:
tf.Tensor(
[[[1. 1. 0. 0. 0.]]

[[1. 1. 1. 0. 0.]]], shape=(2, 1, 5), dtype=float32)
Instead of the 2x5x5 you show above.

And when I run this by ‘x + (1 - create_padding_mask(x)) * -1.0e9’ I get:
tf.Tensor(
[[[ 1.2e+01 1.3e+01 -1.0e+09 -1.0e+09 -1.0e+09]
[ 1.1e+01 1.2e+01 -1.0e+09 -1.0e+09 -1.0e+09]]

[[ 1.2e+01 1.3e+01 0.0e+00 -1.0e+09 -1.0e+09]
[ 1.1e+01 1.2e+01 1.0e+00 -1.0e+09 -1.0e+09]]], shape=(2, 2, 5), dtype=float32)

But I still don’t get why are we adding an additional dimension and how works the addition of two tensors of (2,1,5) and (2, 5) shapes.

The example provided is regarding the suggested behavior and not the current notebook behavior.

As far as adding dissimilar shapes is concerned, please see this link on broadcasting.