When creating a padding mask, we are told that it needs to be broadcasted across the sequence rows. However, I do not understand the purpose of this.
What I can immediately understand is that the first group of values that is output by the print(tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9)) line represents the softmax output when each sequence has been masked by its own padded mask. However, the other two groups of values below it seem to be values when I apply the masks of one sequence to another sequence.
It is to ensure that the padding positions are properly ignored during computations like the softmax operation. When you apply a mask, -1.0e9 is added to those padding positions, to make their probabilities near zero.
The other groups of values represent the softmax output when applying one sequence’s mask to another sequence, which simulates the attention mechanism.
Hope it helps! Feel free to ask if you need further assistance.