In the last piece of code in the section 2.1 - Padding Mask, we have an example of how the softmax is diferent aplying or not the padding mask.
print(tf.keras.activations.softmax(x))
print(tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9))
However, the broadcasting doesn’t make sense to me. I modified the code a little to make my point more clear:
print(x)
y = (1 - create_padding_mask(x)) * -100
print(y)
z=x+y
print(z)
Here is the ouput:
tf.Tensor(
[[7. 6. 0. 0. 1.]
[1. 2. 3. 0. 0.]
[0. 0. 0. 4. 5.]], shape=(3, 5), dtype=float32)tf.Tensor(
[[[ -0. -0. -100. -100. -0.]]
[[ -0. -0. -0. -100. -100.]]
[[-100. -100. -100. -0. -0.]]], shape=(3, 1, 5), dtype=float32)tf.Tensor(
[[[ 7. 6. -100. -100. 1.]
[ 1. 2. -97. -100. 0.]
[ 0. 0. -100. -96. 5.]][[ 7. 6. 0. -100. -99.]
[ 1. 2. 3. -100. -100.]
[ 0. 0. 0. -96. -95.]][[ -93. -94. -100. 0. 1.]
[ -99. -98. -97. 0. 0.]
[-100. -100. -100. 4. 5.]]], shape=(3, 3, 5), dtype=float32)
I dont really see why it is broadcasted to (3,3,5) instead of (3,1,5).
The only values that make sense to me in ‘z’ are:
-First row of the first (3,3): [ 7. 6. -100. -100. 1.]
-Sencon row of the second (3,3): [ 1. 2. 3. -100. -100.]
-Third row of the third (3,3): [-100. -100. -100. 4. 5.]
Therefore, the (3,1,5) z would be this:
tf.Tensor(
[[[ 7. 6. -100. -100. 1.]]
[[ 1. 2. 3. -100. -100.]]
[[-100. -100. -100. -0. -0.]]], shape=(3, 1, 5), dtype=float32)
Am I wrong?