C5W4A1E3 Transformer Architecture (scaled_dot_product_attention) issue

Steven22 · February 17, 2025, 4:47pm

In Exercise 3 - scaled_dot_product_attention I am getting an assertion error that shows no difference between my calculated output and the correct output tensor. How would the assertion be False if both my output and the correct output are the same or similar?

rmwkwok · February 17, 2025, 5:00pm

Hello, @Steven22,

The assert checks weights but seems you printed output. Perhaps did you swap anything?

Cheers,
Raymond

paulinpaloalto · February 18, 2025, 6:22pm

There are lots of details to get right of course. I did my usual debugging method of adding print statements to show intermediate values. Here’s what I see when I run the test cell for that function with code that passes both the unit tests and the grader:

q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk.shape (3, 4)
matmul_qk =
[[2. 3. 1. 1.]
 [2. 2. 2. 1.]
 [2. 2. 0. 1.]]
dk 4.0
type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
attention_weights.shape (3, 4)
attention_weights =
[[0.2589478  0.42693272 0.15705977 0.15705977]
 [0.2772748  0.2772748  0.2772748  0.16817567]
 [0.33620113 0.33620113 0.12368149 0.2039163 ]]
sum(attention_weights(axis = -1)) =
[[1.0000001]
 [1.       ]
 [1.       ]]
output.shape (3, 2)
output =
[[0.74105227 0.15705977]
 [0.7227253  0.16817567]
 [0.6637989  0.2039163 ]]
q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk.shape (3, 4)
matmul_qk =
[[2. 3. 1. 1.]
 [2. 2. 2. 1.]
 [2. 2. 0. 1.]]
dk 4.0
type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
mask.shape (1, 3, 4)
applying mask =
[[[1 1 0 1]
  [1 1 0 1]
  [1 1 0 1]]]
attention_weights.shape (1, 3, 4)
attention_weights =
[[[0.3071959  0.5064804  0.         0.18632373]
  [0.38365173 0.38365173 0.         0.23269653]
  [0.38365173 0.38365173 0.         0.23269653]]]
sum(attention_weights(axis = -1)) =
[[[1.]
  [1.]
  [1.]]]
output.shape (1, 3, 2)
output =
[[[0.6928041  0.18632373]
  [0.61634827 0.23269653]
  [0.61634827 0.23269653]]]
All tests passed

One approach would be to add similar print statements to your code and see if the comparison sheds any light on where our code differs.

Steven22 · February 19, 2025, 3:12am

Paul. Finally got Exercise 3 to work. Your outputs gave me something to shoot for. I was reading too much into the explanations of the algorithm. I had extra lines of code that were unnecessary.
I was using tf.keras.utils.normalize() to normalize the scaled logits matrix prior to application of the softmax activation.

paulinpaloalto · February 19, 2025, 3:34am

That’s good news. Glad to hear that having the intermediate results was helpful. Thanks for confirming!

patrick.zimmerman · April 15, 2025, 12:25am

@paulinpaloalto i appear to be getting the same values for attention_weights and output as you are, but am still getting the error message about wrong masked weights. any idea what the issue is?

paulinpaloalto · April 15, 2025, 5:12am

Note that there are two test cases: with and without masks. You only seem to be showing the output for the unmasked case, but are failing the test for the masked case.

patrick.zimmerman · April 16, 2025, 12:41am

Thanks, Paul. That nudge got me there.

paulinpaloalto · April 16, 2025, 12:48am

That’s great news. Nice work!

Topic		Replies	Views
C5 W4 A1:Scaled dot product attention error Sequence Models	5	856	March 25, 2023
W4 A1 \| Transformers: scaled dot assessment Sequence Models	6	1004	October 11, 2022
C5_W4_A1_Transformer_Subclass_v1 Unit 3 scaled_dot_product_attention Error Sequence Models	1	874	August 10, 2021
DLS Course 5 W4 A1 Exercise 3 Sequence Models	2	586	March 15, 2022
DLS course 5 w4 scaled dot product attention Sequence Models	3	893	August 20, 2021

C5W4A1E3 Transformer Architecture (scaled_dot_product_attention) issue

Related topics