C5_W4_A1_Transformer_Subclass_v1: intuition for (1 - mask)*-1e9

Hi, I spent some time looking at forum posts on Friday and even completed the assignment, but would still appreciate help with understanding Section 2 - Masking:

  1. Why do we add (1 - mask) * -1e9 instead of just mask? Is it to amplify the difference between words to attend to (these elements of x are unchanged) and padded / look-ahead masked words which should be ignored (these elements of x are mapped to -Inf)?

Also, in the notebook text it says, “If we multiply (1 - mask) by -1e9 and add it to the sample input sequences, the zeros are essentially set to negative infinity. Notice the difference when taking the softmax of the original sequence and the masked sequence:”

  1. The difference is that the softmax outputs 0 for words which should be ignored? And the following line of code - print(tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9) - is broadcasting x along the second dimension of the mask? (The mask is created earlier using create_padding_mask() which returns seq[:, tf.newaxis, :])

Thanks!

If you set the elements you don’t care about to 0 before you feed them to softmax, think about what that does. softmax(0) is not trivial, because e^0 = 1, right? Depending on the other values, it may still be relatively small, but softmax of -1 * 10^9 is a very small number.

1 Like

You can see that even in float64, that rounds to 0.0:

x = np.array([-1e9],dtype=np.float64)
print(f"x = {x}")
print(f"np.exp(x) = {np.exp(x)}")
x = [-1.e+09]
np.exp(x) = [0.]
1 Like

Thanks! So 0 would lead to 1 / (sum of all exponents), which is still something, when we want absolutely nothing because the word should be ignored. Whereas e^{-\infty} leads to 0 as desired (regardless of the denominator in the softmax function as there should be at least one word not being ignored).

1 Like