C5_W4_A1_Transformer_Subclass_v1: intuition for (1 - mask)*-1e9

nikhilvs · November 3, 2025, 3:43pm

Hi, I spent some time looking at forum posts on Friday and even completed the assignment, but would still appreciate help with understanding Section 2 - Masking:

Why do we add (1 - mask) * -1e9 instead of just mask? Is it to amplify the difference between words to attend to (these elements of x are unchanged) and padded / look-ahead masked words which should be ignored (these elements of x are mapped to -Inf)?

Also, in the notebook text it says, “If we multiply (1 - mask) by -1e9 and add it to the sample input sequences, the zeros are essentially set to negative infinity. Notice the difference when taking the softmax of the original sequence and the masked sequence:”

The difference is that the softmax outputs 0 for words which should be ignored? And the following line of code - print(tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9) - is broadcasting x along the second dimension of the mask? (The mask is created earlier using create_padding_mask() which returns seq[:, tf.newaxis, :])

Thanks!

paulinpaloalto · November 3, 2025, 6:11pm

If you set the elements you don’t care about to 0 before you feed them to softmax, think about what that does. softmax(0) is not trivial, because e^0 = 1, right? Depending on the other values, it may still be relatively small, but softmax of -1 * 10^9 is a very small number.

paulinpaloalto · November 3, 2025, 6:19pm

You can see that even in float64, that rounds to 0.0:

x = np.array([-1e9],dtype=np.float64)
print(f"x = {x}")
print(f"np.exp(x) = {np.exp(x)}")
x = [-1.e+09]
np.exp(x) = [0.]

nikhilvs · November 3, 2025, 6:26pm

Thanks! So 0 would lead to 1 / (sum of all exponents), which is still something, when we want absolutely nothing because the word should be ignored. Whereas e^{-\infty} leads to 0 as desired (regardless of the denominator in the softmax function as there should be at least one word not being ignored).

Topic		Replies	Views
C5_W4_A1_transformer_subclass questions about padding and create_padding_mask Sequence Models week-module-4 , coursera-platform	2	158	May 12, 2024
C5W4: padding mask in transformer Sequence Models coursera-platform	2	578	March 15, 2025
C5 W4 A1 Ex-3 Questions (scaled_dot_product_attention) Sequence Models coursera-platform	6	620	October 18, 2022
[Week 4] Ex 3 Self attention hint explanation Sequence Models coursera-platform	3	593	December 6, 2021
Course5_week4 Size of mask after softmax Sequence Models coursera-platform	6	693	March 15, 2025

C5_W4_A1_Transformer_Subclass_v1: intuition for (1 - mask)*-1e9

Related topics