In the last piece of code in the section 2.1 - Padding Mask, we have an example of how the softmax is diferent aplying or not the padding mask.

print(tf.keras.activations.softmax(x))

However, the broadcasting doesn’t make sense to me. I modified the code a little to make my point more clear:

print(x)
print(y)
z=x+y
print(z)

Here is the ouput:

tf.Tensor(
[[7. 6. 0. 0. 1.]
[1. 2. 3. 0. 0.]
[0. 0. 0. 4. 5.]], shape=(3, 5), dtype=float32)

tf.Tensor(
[[[ -0. -0. -100. -100. -0.]]
[[ -0. -0. -0. -100. -100.]]
[[-100. -100. -100. -0. -0.]]], shape=(3, 1, 5), dtype=float32)

tf.Tensor(
[[[ 7. 6. -100. -100. 1.]
[ 1. 2. -97. -100. 0.]
[ 0. 0. -100. -96. 5.]]

[[ 7. 6. 0. -100. -99.]
[ 1. 2. 3. -100. -100.]
[ 0. 0. 0. -96. -95.]]

[[ -93. -94. -100. 0. 1.]
[ -99. -98. -97. 0. 0.]
[-100. -100. -100. 4. 5.]]], shape=(3, 3, 5), dtype=float32)

I dont really see why it is broadcasted to (3,3,5) instead of (3,1,5).
The only values that make sense to me in ‘z’ are:
-First row of the first (3,3): [ 7. 6. -100. -100. 1.]
-Sencon row of the second (3,3): [ 1. 2. 3. -100. -100.]
-Third row of the third (3,3): [-100. -100. -100. 4. 5.]

Therefore, the (3,1,5) z would be this:

tf.Tensor(
[[[ 7. 6. -100. -100. 1.]]
[[ 1. 2. 3. -100. -100.]]
[[-100. -100. -100. -0. -0.]]], shape=(3, 1, 5), dtype=float32)

Am I wrong?

I other words, instead of this:

print(tf.keras.activations.softmax(x))