W4 A1 UNQ_C3 assignment for Sequence models class

Dennis_Sinitsky · February 1, 2024, 3:30pm

I am a little confused about dimensions in Q3 problem. In the comments q, k, v dimensions are given as (…, a, b) where … is dimension for batch size. But when I run my code and print out shapes for q, v, k, I get 2-dim matrices instead.
On the other hand, mask is 3-dim (1, a, b) shape. When I do staightforward addition: scaled_attention_logits += (1. - mask) * 1.0e-9, I get an error as 3-dim tensor mask does not add to 2-dim original matrix. If I do np.squeeze(mask), I get wrong weights error message. Can you please help me figure out the problem. Thank you

paulinpaloalto · February 1, 2024, 6:27pm

I assume you mean the DLS C5 W4 A1 assignment and the function UNQ_C3 which is scaled_dot_product_attention. I added a bunch of print statements to my code and here’s what I see when I run the test cell:

q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk =
[[2. 3. 1. 1.]
 [2. 2. 2. 1.]
 [2. 2. 0. 1.]]
dk 4.0
attention_weights.shape (3, 4)
attention_weights =
[[0.2589478  0.42693272 0.15705977 0.15705977]
 [0.2772748  0.2772748  0.2772748  0.16817567]
 [0.33620113 0.33620113 0.12368149 0.2039163 ]]
sum(attention_weights(axis = -1)) =
[[1.0000001]
 [1.       ]
 [1.       ]]
output.shape (3, 2)
output =
[[0.74105227 0.15705977]
 [0.7227253  0.16817567]
 [0.6637989  0.2039163 ]]
q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk =
[[2. 3. 1. 1.]
 [2. 2. 2. 1.]
 [2. 2. 0. 1.]]
dk 4.0
mask.shape (1, 3, 4)
applying mask =
[[[1 1 0 1]
  [1 1 0 1]
  [1 1 0 1]]]
attention_weights.shape (1, 3, 4)
attention_weights =
[[[0.3071959  0.5064804  0.         0.18632373]
  [0.38365173 0.38365173 0.         0.23269653]
  [0.38365173 0.38365173 0.         0.23269653]]]
sum(attention_weights(axis = -1)) =
[[[1.]
  [1.]
  [1.]]]
output.shape (1, 3, 2)
output =
[[[0.6928041  0.18632373]
  [0.61634827 0.23269653]
  [0.61634827 0.23269653]]]
All tests passed

You’ll notice that the mask is not specified in all the tests, but when it is it works fine with broadcasting to do the +=. So now the question is why that didn’t work for you. It shouldn’t take any extra effort of reshaping to get that to work. Notice also that the resultant shapes of attention_weights and output are different depending on whether the mask is specified or not.

paulinpaloalto · February 1, 2024, 6:47pm

One thing to check is to make sure that you’re doing everything with TF ops here. One danger with “+=” is that it’s “over-loaded”, right? Meaning that it will figure out what functions to actually call by the types of the arguments. Although “broadcasting” should work in any case one would think. But if I had to come up with a theory for what is going wrong there, it would be some kind of type issue. I added yet more print statements and here’s what I see for the type of scaled_attention_logits:

type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)

Dennis_Sinitsky · February 1, 2024, 9:05pm

Hi Paul,
thanks for your reply. Actually, I am done with this whole assignment except for this C3 problem.
I think my attention_weights formula is incorrect when it operates with mask (without mask I am OK as compared with your prints). I am doing what the instruction says:

# add the mask to the scaled tensor.
if mask is not None: # Don't replace this None
    scaled_attention_logits = scaled_attention_logits[np.newaxis,:,:] + (1. - mask) * 1.0e-9

I have to add a new axis to make it 3D tensor in case of mask, but if there is mask my attention weights are different from correct ones (but shape is OK)

paulinpaloalto · February 1, 2024, 9:20pm

I can’t explain it based on the evidence you’ve given us so far, but I literally did not have to do that. I just used

scaled_attention_logits += <mask formula here>

and it worked fine with the dimensions as documented in my earlier reply. So I’m actually interested to try to explain this. Did you try my suggestion of checking the type of scaled_attention_logits?

Are you saying that you fail the unit tests for UNQ_C3 with your solution as shown there? I will try your version and see what happens for me.

paulinpaloalto · February 1, 2024, 9:28pm

As expected, your method works for me. It’s logically equivalent but you just make the broadcasting operation explicit.

So why is it different in your case? There must be something in your earlier code that is wrong. Maybe time for the DM thread to share code. I’ll send you a DM.

Dennis_Sinitsky · February 1, 2024, 9:34pm

Hi Paul,
I know why broadcasting was different. I converted scaled_attention_logits to TF tensor inside softmax expression (after doing all +mask algebra). Now I do it right when I get it calculated for the first time, before +mask algebra, and it automatically adjusts to mask tensor dimension so I can use += f(mask), like you do and the code template was directing. But my values are still wrong. This is the formula I use based on istructions and my weights are different from yours in case of applying mask (if no mask – I am correct):
scaled_attention_logits += (1. - mask) * 1.0e-9

paulinpaloalto · February 1, 2024, 9:48pm

You lost me there: why do you have to convert the type of scaled_attention_logits? You get it by doing a tf.matmul of two TF tensors followed by dividing that result by a scalar tensor, right? So it should already be a TF EagerTensor.

In other words the problem here has nothing to do with mask. The values are already wrong before you get there. This should not be complicated: look again at the formulas that you are implementing here.

paulinpaloalto · February 1, 2024, 9:54pm

Here is the formula:

\displaystyle \frac {QK^T}{\sqrt{d_k}}

The operation in the numerator there is a dot product, right? As in tf.matmul. That is Prof Ng’s notational convention and it’s been that way from the “get go”. If he meant elementwise, he would have used “*” explicitly.

TMosh · February 1, 2024, 9:55pm

I would guess there is an error in one of these lines:

  matmul_qk = None  # (..., seq_len_q, seq_len_k)

    # scale matmul_qk
    dk = None

Dennis_Sinitsky · February 1, 2024, 10:19pm

Thank you, guys. Once again I did not carefully read instructions. I used np.matmul and not tf.matmul. That was first error. Then I multiplied by 1e-9 instead of -1e9 (as is necessary for softmax!!!). That was another error.
Now all is fixed.
Thanks again!

paulinpaloalto · February 1, 2024, 10:21pm

Glad to hear you found the issues. Yet again we learn the lesson that “saving time” by not reading the instructions diligently is frequently not a net savings of time.

Also note that I gave you the suggestion of checking the type of scaled_attention_logits four hours ago.

Dennis_Sinitsky · February 1, 2024, 10:44pm

Yes, but I was not continuously working on this problem; I had a break
Anyway, thanks a lot!

Dennis_Sinitsky · February 1, 2024, 10:47pm

I am done with Deep Learning Specialization.
I am enrolling into TensorFlow Developer Professional Specialization now. First, improve my TF skills and then move on to more LLP!

paulinpaloalto · February 2, 2024, 12:01am

That’s great! Congratulations on getting through DLS pretty quickly. There is no shortage of other things to learn. Onward!

abder · July 9, 2024, 12:01pm

In this exercice, what is the right way to get dk ?

saifkhanengr · July 9, 2024, 12:50pm

To use cast and shape from TensorFlow. Like tf.cast.

abder · July 9, 2024, 3:53pm

I mean, the shape of what matrix by definition?

saifkhanengr · July 9, 2024, 4:02pm

Depth of the key matrix.

Topic		Replies	Views
W4 A1 \| Ex-3 \| Scaled Dot Product Attention Sequence Models coursera-platform	27	3223	March 24, 2025
C5 W4 A1 Ex-3 Bad attention_weights values Sequence Models coursera-platform	5	652	October 18, 2022
C5 W4 A1 E3 help me I don't understand the dimensions of scaled_dot_product_attention Sequence Models week-module-4 , coursera-platform	3	268	February 5, 2024
C5 W4 Lab1 E3 Sequence Models week-module-4 , coursera-platform	8	23	March 25, 2025
Week 4: scaled_dot_product_attention Sequence Models coursera-platform	3	905	August 5, 2021

W4 A1 UNQ_C3 assignment for Sequence models class

Related topics