Why do we broadcast the Padded Mask across the sequences? W4A1

dillonloh · September 14, 2024, 8:03am

Link to assignment:

When creating a padding mask, we are told that it needs to be broadcasted across the sequence rows. However, I do not understand the purpose of this.

What I can immediately understand is that the first group of values that is output by the print(tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9)) line represents the softmax output when each sequence has been masked by its own padded mask. However, the other two groups of values below it seem to be values when I apply the masks of one sequence to another sequence.

What is the purpose of this? What do these groups of values represent?

Alireza_Saei · September 14, 2024, 11:01am

Hi @dillonloh

It is to ensure that the padding positions are properly ignored during computations like the softmax operation. When you apply a mask, -1.0e9 is added to those padding positions, to make their probabilities near zero.

The other groups of values represent the softmax output when applying one sequence’s mask to another sequence, which simulates the attention mechanism.

Hope it helps! Feel free to ask if you need further assistance.

Topic		Replies	Views
C5_W4 Masking issue (?!) Sequence Models week-4	2	133	May 16, 2024
C5w4 2.1 Padding mask Sequence Models week-4	9	281	March 9, 2024
C5_W4_A1_transformer_subclass questions about padding and create_padding_mask Sequence Models week-4	2	145	May 12, 2024
C5W4: padding mask in transformer Sequence Models	1	553	October 26, 2021
Why is extra dim needed in create_padding_mask in Transformer Network assignment Sequence Models	1	573	July 23, 2021

Why do we broadcast the Padded Mask across the sequences? W4A1

Related topics