C5 W4 A1 E3AssertionError: Wrong unmasked weights

I am getting this error in the scaled_dot_product_attention function .

attention weights - (3, 4)
V shape - (4, 2)
<class ‘tensorflow.python.framework.ops.EagerTensor’>

AssertionError Traceback (most recent call last)
in
1 # UNIT TEST
----> 2 scaled_dot_product_attention_test(scaled_dot_product_attention)

~/work/W4A1/public_tests.py in scaled_dot_product_attention_test(target)
60 assert np.allclose(weights, [[0.2589478, 0.42693272, 0.15705977, 0.15705977],
61 [0.2772748, 0.2772748, 0.2772748, 0.16817567],
—> 62 [0.33620113, 0.33620113, 0.12368149, 0.2039163 ]]), “Wrong unmasked weights”
63
64 assert tf.is_tensor(attention), “Output must be a tensor”

AssertionError: Wrong unmasked weights

Could not figure out the cause of this in the code. Any help is appreciated.

If none of these search results help, please click my name and message your notebook as an attachment.

I have sent you the notebook

Please fix the following:

  1. When calculating the matmul_qk, k is not multiplied directly. It’s transformed before multiplication. See the equation in the markdown for details.
  2. It’s safer to use negative indexing to find dk since earlier dimeisions are not fixed.
  3. For if mask is not None case, read this hint: Multiply (1. - mask) by -1e9 before applying the softmax.
  4. Calculation of output is incorrect. Go back to the equation and notice that there are only 2 terms and not 3.

Balaji,

Thanks for the response .

in this -

  1. For if mask is not None case, read this hint: Multiply (1. - mask) by -1e9 before applying the softmax.
    Where is this -1e9 factor coming from? What do we need to do?
    does it read “(1 dot -mask) * -1e9”

Thanks
Bimal

See this:

I do not understand how k should be transformed before multiplication. Where is this explained?

What I meant was the transpose operation being applied to the key before multiplication.

Here are the steps:

  1. matmul_qk is created by multiplying the query and the key, keeping shapes in mind.
  2. dk should be initialized to the correct value based on shape of the key.
  3. scaled_attention_logits is calculated by dividing 2 quantities (see steps 1 and 2 for reference)
  4. If mask is not None, add the mask to the logits computed thus far.
  5. Attention weights is computed by performing softmax on the computed logits.