As we can see in the above image we add an extra dimension to padding mask like [:, tf.newaxis, :]
When we feed data to attention block it will be like:
input: (batch_size, m, n) m is number of token, n is embedding dimensions
Q: input * Wq which is of shape (batch, m, dk)
K: input * Wk which is of shape (batch, m, dk)
V: input * Wv which is of shape (batch, m, dv)
Here we assumen dk = dv = dm / h
Then we do (Q * K.T) which is of shape (batch, m, m).
And then we do (Q * K.T + M), so mask should be of shape (batch, m, m)
but as we can see in the above image we return a mask of shape
(n, 1, m), should we add the new dimension to in the front like (1, n, m) so that it can be broadcasted, also should the mask be like (1, m, m) why is there an ânâ in it.
Iâm no expert on keras but I believe this is how the â1â dimension is broadcasted m times to become (batch, m, m). From the docs:
Broadcasting can happen for the missing batch dimensions and the head dimension.
The behavior is not precisely explained, but I assume the 1 would represent the missing head dimensionâŚ
You should check the tensorflow implementation to confirm or dispute that since I donât want to go deeper. (Btw, the link âView source on GitHubâ is â404 - page not foundâ as is often the case with TensorFlow )
I donât think that makes sense. What do you mean by that? What would be the shape after broadcasting?
The mask would be (1, m, m) if the batch size was 1 (for example, for inference). But since we have n sequences, we want the (n, m, m) shape. In other words, n here represent the batch size.