Hi, I spent some time looking at forum posts on Friday and even completed the assignment, but would still appreciate help with understanding Section 2 - Masking:
- Why do we add (1 - mask) * -1e9 instead of just mask? Is it to amplify the difference between words to attend to (these elements of x are unchanged) and padded / look-ahead masked words which should be ignored (these elements of x are mapped to -Inf)?
Also, in the notebook text it says, “If we multiply (1 - mask) by -1e9 and add it to the sample input sequences, the zeros are essentially set to negative infinity. Notice the difference when taking the softmax of the original sequence and the masked sequence:”
- The difference is that the softmax outputs 0 for words which should be ignored? And the following line of code -
print(tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9)- is broadcasting x along the second dimension of the mask? (The mask is created earlier using create_padding_mask() which returnsseq[:, tf.newaxis, :])
Thanks!