Hey,

In the example in the notebook:

x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])

tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9)

The result is a tensor (let’s call it T) of shape (3, 3, 5). Why this tensor did not satisfy T[:, 0, :] = T[:, 1, :] = T[:, 2, :].?

Note that they add the new dimension as the second dimension, not the first dimension. That makes the shape of the result (3, 1, 5), as you see in the output:

```
x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])
print(create_padding_mask(x))
tf.Tensor(
[[[1. 1. 0. 0. 1.]]
[[1. 1. 1. 0. 0.]]
[[0. 0. 0. 1. 1.]]], shape=(3, 1, 5), dtype=float32)
```

The individual rows are different because the inputs are different, right?

In the code I provided, a tensor of shape (3,5) was added to a tensor of shape (3,1,5). The resulting tensor should be (3,3,5) because of broadcasting. I actually tried to ask why the resulted tensor (let’s call it T) did not satisfy T[:, 0, :] = T[:, 1, :]. But now I understand why…

The reason T[:, 0, :] differs from T[:, 1, :] and from T[:, 2, :] is that the broadcasting mechanism interacts differently with the added tensors. In the tensor x, each row has distinct values, so when added to the masked tensor, each row ends up with a distinct pattern before the softmax is applied. The masked tensor, the large negative values introduced by (1 - create_padding_mask(x)) * -1.0e9, effectively “zero out” different parts of x in each row, depending on the mask. Even though the masked values might be the same, the non-masked values in each row of x differ, leading to distinct softmax outputs. Thus, after adding x to the broadcasted tensor and applying softmax, each slice along the first dimension of T represents a softmax of a slightly different set of numbers, resulting in different distributions.

2 Likes