Masking np. where, np.full_like

Hello, I am having trouble to understand this line of code for masking
np.where(m, dots,np.full_like(dots, -1e9)). I would appreciate to get some help.

I played with the example below from ungraded lab 1:
q = create_tensor([[1, 0, 0], [0, 1, 0]])
display_tensor(q, ‘query’)
k = create_tensor([[1, 2, 3], [4, 5, 6]])
display_tensor(k, ‘key’)
m = create_tensor([[0, 0], [-1e9, 0]])
display_tensor(m, ‘mask’)

if m is not None:

dots = np.where(m, dots,np.full_like(dots, -1e9)) 

so this is my m:
[[ 0.e+00 0.e+00]
[-1.e+09 0.e+00]]

this is my dots before mask:
array([[0.57735027, 2.30940108],
[1.15470054, 2.88675135]])

this is my dots after mask:
array([[-1.00000000e+09, -1.00000000e+09],
[ 1.15470054e+00, -1.00000000e+09]])

So I am guessing this line of code np.where(m, dots,np.full_like(dots, -1e9))
is saying, where m is 0, we replace with a large negative number -1e9, where m is not 0 (that’s below the diagonal) we keep it. Am I right?

but I thought we were to mask with a matrix like this
image
so shouldn’t the mask m be like
[[0,inf]
[0,0]] instead of the one:
[[ 0.e+00 0.e+00]
[-1.e+09 0.e+00]] from the lab?
if I follow the picture, my line of code should be like:
dots=dots+
[[0,inf]
[0,0]]

Hi @Fei_Li

Good job for spotting this mistake :+1:

Yes the m variable created in the lab is wrong (at least it does not reflect the picture you mentioned and the usual overall use; in theory you could make it work if you transpose it or transpose the dots, but it’s not what usually happens).

I will submit it for fixing so that future learners would not get confused by it, thanks to you :slight_smile:

Sure. Thank you. I was trying to learn though.

So you mean applying a transpose on a mask like this?
m = create_tensor([[0, 0], [-1e9, 0]]).T
then my mask would be
image

then I use this line of code
if m is not None:

dots = np.where(m, dots,np.full_like(dots, -1e9)) 

to apply on my dots, which is
image

I get
image

I thought the above diagonal numbers should be 0, but now they are below. also, the numbers on diagonals should not be zeros.

I still don’t understand np. full_like and np. where (m, dots, …), would you help me with that as well?

Appreciate your help.

Yes, that is the way the mask supposed to look according to the pictures (when we use mask as an addition to the dots).

So the continuation should be:
image
and when we apply mask as in the picture, we get :
image

Below is my additional input for your learning (This is not in the Lab):


To get the attention scores we would apply softmax:
image

These would be the attention weights. How to interpret them:

  • for the first token (word) we would assign all the attention to the first token and ignore the second; (or concretely: 1.0 * first_token_embeddings + 0.0 * second_token_embeddings)
  • for the second token we would assign 85% of attention to the second and 15% attention to the first; (or concretely: 0.15 * first_token_embeddings + 0.85 * second_token_embeddings)

Unfortunately (or fortunately) this Lab also has another implementation of DotProductAttention with different approach which might confuse you. In the second approach, the mask is generated with 1 s and consequently Boolean. The second implementation (not the one in the pictures but similar nonetheless):
image

Ah, sure sure. Now it’s so much clear. I compared your code with mine. I realized I was using “dots+mask” and then applied “np.where(mask, dots… )” again…

Also thanks for the additional explanation for weights. Now I understand better the “attention”

I didn’t understand this string

np.swapaxes(key, -1, -2)

why should we take for transpose only last 2 axis?

and

logsumexp = scipy.special.logsumexp(dots, axis=-1, keepdims=True)

How should I tumble to axis=-1, keepdims=True from description?