Sorry for not being clear.
What I meant was that the staff have been notified to update the documentation and the function (if required).
For starters,
x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])
doesn’t make sense to me since 0
is usually used as a padding token. This means that 0s should be present at the trailing part of each inner array.
The following example is more inline with the transformer architecture:
If integer encoded and padded batch looks like this for 2 sentences:
tf.constant([[12, 13, 0, 0, 0],
[11, 12, 1, 0, 0]])
the padding mask should be
<tf.Tensor: shape=(2, 5, 5), dtype=int32, numpy=
array([[[1, 1, 0, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[1, 1, 1, 0, 0],
[1, 1, 1, 0, 0],
[1, 1, 1, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]]], dtype=int32)>
Please implement scaled_dot_product_attention
and you’ll see how mask is used.