Hello, I am having trouble to understand this line of code for masking
np.where(m, dots,np.full_like(dots, -1e9)). I would appreciate to get some help.

I played with the example below from ungraded lab 1:
q = create_tensor([[1, 0, 0], [0, 1, 0]])
display_tensor(q, ‘query’)
k = create_tensor([[1, 2, 3], [4, 5, 6]])
display_tensor(k, ‘key’)
m = create_tensor([[0, 0], [-1e9, 0]])

if m is not None:

``````dots = np.where(m, dots,np.full_like(dots, -1e9))
``````

so this is my m:
[[ 0.e+00 0.e+00]
[-1.e+09 0.e+00]]

this is my dots before mask:
array([[0.57735027, 2.30940108],
[1.15470054, 2.88675135]])

this is my dots after mask:
array([[-1.00000000e+09, -1.00000000e+09],
[ 1.15470054e+00, -1.00000000e+09]])

So I am guessing this line of code np.where(m, dots,np.full_like(dots, -1e9))
is saying, where m is 0, we replace with a large negative number -1e9, where m is not 0 (that’s below the diagonal) we keep it. Am I right？

but I thought we were to mask with a matrix like this

so shouldn’t the mask m be like
[[0,inf]
[[ 0.e+00 0.e+00]
[-1.e+09 0.e+00]] from the lab?
if I follow the picture, my line of code should be like:
dots=dots+
[[0,inf]
[0,0]]

Hi @Fei_Li

Good job for spotting this mistake

Yes the `m` variable created in the lab is wrong (at least it does not reflect the picture you mentioned and the usual overall use; in theory you could make it work if you transpose it or transpose the dots, but it’s not what usually happens).

I will submit it for fixing so that future learners would not get confused by it, thanks to you

Sure. Thank you. I was trying to learn though.

So you mean applying a transpose on a mask like this?
m = create_tensor([[0, 0], [-1e9, 0]]).T

then I use this line of code
if m is not None:

``````dots = np.where(m, dots,np.full_like(dots, -1e9))
``````

to apply on my dots, which is

I get

I thought the above diagonal numbers should be 0, but now they are below. also, the numbers on diagonals should not be zeros.

I still don’t understand np. full_like and np. where (m, dots, …), would you help me with that as well?

Yes, that is the way the mask supposed to look according to the pictures (when we use mask as an addition to the dots).

So the continuation should be:

and when we apply mask as in the picture, we get :

Below is my additional input for your learning (This is not in the Lab):

To get the attention scores we would apply softmax:

These would be the attention weights. How to interpret them:

• for the first token (word) we would assign all the attention to the first token and ignore the second; (or concretely: `1.0 * first_token_embeddings + 0.0 * second_token_embeddings`)
• for the second token we would assign 85% of attention to the second and 15% attention to the first; (or concretely: `0.15 * first_token_embeddings + 0.85 * second_token_embeddings`)

Unfortunately (or fortunately) this Lab also has another implementation of DotProductAttention with different approach which might confuse you. In the second approach, the mask is generated with 1 s and consequently Boolean. The second implementation (not the one in the pictures but similar nonetheless):

Ah, sure sure. Now it’s so much clear. I compared your code with mine. I realized I was using “dots+mask” and then applied “np.where(mask, dots… )” again…

Also thanks for the additional explanation for weights. Now I understand better the “attention”

I didn’t understand this string

``````np.swapaxes(key, -1, -2)
``````

why should we take for transpose only last 2 axis?

and

``````logsumexp = scipy.special.logsumexp(dots, axis=-1, keepdims=True)
``````

How should I tumble to axis=-1, keepdims=True from description?