C4_W2 ungraded lab masking, regarding addition of an extra dimension while creating padding mask

God_of_Calamity · May 2, 2024, 6:29am

As we can see in the above image we add an extra dimension to padding mask like [:, tf.newaxis, :]

When we feed data to attention block it will be like:
input: (batch_size, m, n) m is number of token, n is embedding dimensions
Q: input * Wq which is of shape (batch, m, dk)
K: input * Wk which is of shape (batch, m, dk)
V: input * Wv which is of shape (batch, m, dv)
Here we assumen dk = dv = dm / h

Then we do (Q * K.T) which is of shape (batch, m, m).
And then we do (Q * K.T + M), so mask should be of shape (batch, m, m)
but as we can see in the above image we return a mask of shape
(n, 1, m), should we add the new dimension to in the front like (1, n, m) so that it can be broadcasted, also should the mask be like (1, m, m) why is there an ‘n’ in it.

arvyzukai · May 2, 2024, 7:27am

Hi @God_of_Calamity

Yes, that is correct.

I’m no expert on keras but I believe this is how the “1” dimension is broadcasted m times to become (batch, m, m). From the docs:

Broadcasting can happen for the missing batch dimensions and the head dimension.

The behavior is not precisely explained, but I assume the 1 would represent the missing head dimension…
You should check the tensorflow implementation to confirm or dispute that since I don’t want to go deeper. (Btw, the link “View source on GitHub” is “404 - page not found” as is often the case with TensorFlow )

I don’t think that makes sense. What do you mean by that? What would be the shape after broadcasting?

The mask would be (1, m, m) if the batch size was 1 (for example, for inference). But since we have n sequences, we want the (n, m, m) shape. In other words, n here represent the batch size.

Cheers

Topic		Replies	Views
C4W2 masking ungraded lab question NLP with Attention Models week-2	2	236	March 15, 2024
C4W2 ungraded attention lab question NLP with Attention Models week-1	7	231	March 18, 2024
C5 W4 A1: Why is create_padding_mask adding more dimensions than its supposed to? Sequence Models	1	470	May 31, 2023
Why is extra dim needed in create_padding_mask in Transformer Network assignment Sequence Models	1	573	July 23, 2021
Video: NMT with Attention NLP with Attention Models week-1	1	596	May 20, 2022

C4_W2 ungraded lab masking, regarding addition of an extra dimension while creating padding mask

Related topics