C4_W2 ungraded lab masking, regarding addition of an extra dimension while creating padding mask


As we can see in the above image we add an extra dimension to padding mask like [:, tf.newaxis, :]

When we feed data to attention block it will be like:
input: (batch_size, m, n) m is number of token, n is embedding dimensions
Q: input * Wq which is of shape (batch, m, dk)
K: input * Wk which is of shape (batch, m, dk)
V: input * Wv which is of shape (batch, m, dv)
Here we assumen dk = dv = dm / h

Then we do (Q * K.T) which is of shape (batch, m, m).
And then we do (Q * K.T + M), so mask should be of shape (batch, m, m)
but as we can see in the above image we return a mask of shape
(n, 1, m), should we add the new dimension to in the front like (1, n, m) so that it can be broadcasted, also should the mask be like (1, m, m) why is there an ‘n’ in it.

Hi @God_of_Calamity

Yes, that is correct.

I’m no expert on keras but I believe this is how the “1” dimension is broadcasted m times to become (batch, m, m). From the docs:

Broadcasting can happen for the missing batch dimensions and the head dimension.

The behavior is not precisely explained, but I assume the 1 would represent the missing head dimension…
You should check the tensorflow implementation to confirm or dispute that since I don’t want to go deeper. (Btw, the link “View source on GitHub” is “404 - page not found” as is often the case with TensorFlow :roll_eyes:)

I don’t think that makes sense. What do you mean by that? What would be the shape after broadcasting?

The mask would be (1, m, m) if the batch size was 1 (for example, for inference). But since we have n sequences, we want the (n, m, m) shape. In other words, n here represent the batch size.

Cheers

2 Likes