C5w4 2.1 Padding mask

This is regarding assignment 1, transformer subclass v1. Can anyone explain when taking the softmax of the original sequence and the masked sequence, the output have different shape, (3,5) vs. (3,3,5)?

Thanks!

Please post a screen capture image that shows the part of the notebook you are asking about.

What is the dimension of the output of create_padding_mask? Looks like maybe you’re getting some “broadcasting” action there. :nerd_face:

1 Like

kindly use tf.nn.softmax instead of tf.keras.activations.softmax

1 Like

Can you please explain why?
BTW, this code is already in the assignment, not my own.

The reason the sizes are different is discussed in the create_padding_mask() function:

That’s what “tf.newaxis” does.

1 Like

Thank you!
My previous question is regarding Deepti’s comments.
Why we should use tf.nn.softmax instead of tf.activations.softmax?

Plus tf.activations.softmax is the official code on the site right now.

Thanks,
KJ

Hello @Kejun_Zhang

I meant my comment for the
def scaled_dot_product_attention
softmax is normalized on the last axis (seq_len_k) so that the scores

Regards
DP

I think it’s just an incorrect example…
I feel like adding x (shape (3, 5)) to mask that’s now expanded to (3, 1, 5) is incorrect.
Presumably, 3 is the batch dim, and 5 is the inp seq length, and 1 is the output seq length (which is where the broadcasting should be done)
But in the example, 1 (output seq length) is broadcast over 3 - batch size.

I think the example should also be updated to expand dims of x, i.e.: