W4 A1 UNQ_C3 assignment for Sequence models class

I am a little confused about dimensions in Q3 problem. In the comments q, k, v dimensions are given as (…, a, b) where … is dimension for batch size. But when I run my code and print out shapes for q, v, k, I get 2-dim matrices instead.
On the other hand, mask is 3-dim (1, a, b) shape. When I do staightforward addition: scaled_attention_logits += (1. - mask) * 1.0e-9, I get an error as 3-dim tensor mask does not add to 2-dim original matrix. If I do np.squeeze(mask), I get wrong weights error message. Can you please help me figure out the problem. Thank you

1 Like

I assume you mean the DLS C5 W4 A1 assignment and the function UNQ_C3 which is scaled_dot_product_attention. I added a bunch of print statements to my code and here’s what I see when I run the test cell:

q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk =
[[2. 3. 1. 1.]
 [2. 2. 2. 1.]
 [2. 2. 0. 1.]]
dk 4.0
attention_weights.shape (3, 4)
attention_weights =
[[0.2589478  0.42693272 0.15705977 0.15705977]
 [0.2772748  0.2772748  0.2772748  0.16817567]
 [0.33620113 0.33620113 0.12368149 0.2039163 ]]
sum(attention_weights(axis = -1)) =
[[1.0000001]
 [1.       ]
 [1.       ]]
output.shape (3, 2)
output =
[[0.74105227 0.15705977]
 [0.7227253  0.16817567]
 [0.6637989  0.2039163 ]]
q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk =
[[2. 3. 1. 1.]
 [2. 2. 2. 1.]
 [2. 2. 0. 1.]]
dk 4.0
mask.shape (1, 3, 4)
applying mask =
[[[1 1 0 1]
  [1 1 0 1]
  [1 1 0 1]]]
attention_weights.shape (1, 3, 4)
attention_weights =
[[[0.3071959  0.5064804  0.         0.18632373]
  [0.38365173 0.38365173 0.         0.23269653]
  [0.38365173 0.38365173 0.         0.23269653]]]
sum(attention_weights(axis = -1)) =
[[[1.]
  [1.]
  [1.]]]
output.shape (1, 3, 2)
output =
[[[0.6928041  0.18632373]
  [0.61634827 0.23269653]
  [0.61634827 0.23269653]]]
All tests passed

You’ll notice that the mask is not specified in all the tests, but when it is it works fine with broadcasting to do the +=. So now the question is why that didn’t work for you. It shouldn’t take any extra effort of reshaping to get that to work. Notice also that the resultant shapes of attention_weights and output are different depending on whether the mask is specified or not.

1 Like

One thing to check is to make sure that you’re doing everything with TF ops here. One danger with “+=” is that it’s “over-loaded”, right? Meaning that it will figure out what functions to actually call by the types of the arguments. Although “broadcasting” should work in any case one would think. But if I had to come up with a theory for what is going wrong there, it would be some kind of type issue. I added yet more print statements and here’s what I see for the type of scaled_attention_logits:

type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
1 Like

Hi Paul,
thanks for your reply. Actually, I am done with this whole assignment except for this C3 problem.
I think my attention_weights formula is incorrect when it operates with mask (without mask I am OK as compared with your prints). I am doing what the instruction says:

# add the mask to the scaled tensor.
if mask is not None: # Don't replace this None
    scaled_attention_logits = scaled_attention_logits[np.newaxis,:,:] + (1. - mask) * 1.0e-9

I have to add a new axis to make it 3D tensor in case of mask, but if there is mask my attention weights are different from correct ones (but shape is OK)

I can’t explain it based on the evidence you’ve given us so far, but I literally did not have to do that. I just used

scaled_attention_logits += <mask formula here>

and it worked fine with the dimensions as documented in my earlier reply. So I’m actually interested to try to explain this. Did you try my suggestion of checking the type of scaled_attention_logits?

Are you saying that you fail the unit tests for UNQ_C3 with your solution as shown there? I will try your version and see what happens for me.

As expected, your method works for me. It’s logically equivalent but you just make the broadcasting operation explicit.

So why is it different in your case? There must be something in your earlier code that is wrong. Maybe time for the DM thread to share code. I’ll send you a DM.

Hi Paul,
I know why broadcasting was different. I converted scaled_attention_logits to TF tensor inside softmax expression (after doing all +mask algebra). Now I do it right when I get it calculated for the first time, before +mask algebra, and it automatically adjusts to mask tensor dimension so I can use += f(mask), like you do and the code template was directing. But my values are still wrong. This is the formula I use based on istructions and my weights are different from yours in case of applying mask (if no mask – I am correct):
scaled_attention_logits += (1. - mask) * 1.0e-9

You lost me there: why do you have to convert the type of scaled_attention_logits? You get it by doing a tf.matmul of two TF tensors followed by dividing that result by a scalar tensor, right? So it should already be a TF EagerTensor.

In other words the problem here has nothing to do with mask. The values are already wrong before you get there. This should not be complicated: look again at the formulas that you are implementing here.

Here is the formula:

\displaystyle \frac {QK^T}{\sqrt{d_k}}

The operation in the numerator there is a dot product, right? As in tf.matmul. That is Prof Ng’s notational convention and it’s been that way from the “get go”. If he meant elementwise, he would have used “*” explicitly.

I would guess there is an error in one of these lines:

  matmul_qk = None  # (..., seq_len_q, seq_len_k)

    # scale matmul_qk
    dk = None

Thank you, guys. Once again I did not carefully read instructions. I used np.matmul and not tf.matmul. That was first error. Then I multiplied by 1e-9 instead of -1e9 (as is necessary for softmax!!!). That was another error.
Now all is fixed.
Thanks again!

Glad to hear you found the issues. Yet again we learn the lesson that “saving time” by not reading the instructions diligently is frequently not a net savings of time. :nerd_face:

Also note that I gave you the suggestion of checking the type of scaled_attention_logits four hours ago.

Yes, but I was not continuously working on this problem; I had a break :slight_smile:
Anyway, thanks a lot!

I am done with Deep Learning Specialization.
I am enrolling into TensorFlow Developer Professional Specialization now. First, improve my TF skills and then move on to more LLP! :muscle:

That’s great! Congratulations on getting through DLS pretty quickly. There is no shortage of other things to learn. Onward! :nerd_face:

In this exercice, what is the right way to get dk ?

To use cast and shape from TensorFlow. Like tf.cast.

I mean, the shape of what matrix by definition?

Depth of the key matrix.