C5W4 Ex3 scaled_dot_product_attention - wrong masked weights


I am getting quite confused with the the implementation of Ex3. In the implementation of this method, the “wrong masked weights” assertion error is thrown:

AssertionError                            Traceback (most recent call last)
<ipython-input-40-00665b20febb> in <module>
      1 # UNIT TEST
----> 2 scaled_dot_product_attention_test(scaled_dot_product_attention)

~/work/W4A1/public_tests.py in scaled_dot_product_attention_test(target)
     73     assert np.allclose(weights, [[0.30719590187072754, 0.5064803957939148, 0.0, 0.18632373213768005],
     74                                  [0.3836517333984375, 0.3836517333984375, 0.0, 0.2326965481042862],
---> 75                                  [0.3836517333984375, 0.3836517333984375, 0.0, 0.2326965481042862]]), "Wrong masked weights"
     76     assert np.allclose(attention, [[0.6928040981292725, 0.18632373213768005],
     77                                    [0.6163482666015625, 0.2326965481042862],

AssertionError: Wrong masked weights

I am not quite sure what is causing this. To avoid posting code, here is a summary of my implementation:

  • Multiplied the query and transpose of key matrices through matmul
  • Initialised dk with the depth dimension of k
  • Implemented the mask
  • Used tensor flow softmax method on the scaled attention logits

This leads me to believe that there may really be something wrong with the method implementation in exercise 1 get_angles.

Any help is appreciated!

Errors in get_angles() are unrelated to scaled_dot_product_attention().

Your summary is missing a few steps of computation.

Yep! Just solved the get_angles error (silly mistake) and have edited it out of the question.

Ex3 summary with the missing steps in bold:

  • Multiplied the query and transpose of key matrices through matmul
  • Initialised dk with the depth dimension of k
  • scaled_attention_logits set to the division of matmul_qk and the square root of dk
  • Implemented the mask
  • Used the tensor flow softmax method on the last (sq_len_k) axis of the scaled_attention_logits

Managed to pass all other test cases for the entire assignment. However, due to this single assertion error, the grader returns a grade of 0/100.

Any help with this is greatly appreciated.

Are you using numpy functions, or TensorFlow functions?

Thanks for the reply
I am using Tensorflow functions:

  • tf.linalg.matmul
  • tf.cast(), tf.shape()
  • tf.math.sqrt()
  • tf.nn.softmax() – also tried tf.keras.layers.Softmax(axis)(logits) with same result

I have also tried restarting the kernel just in case.
My method is basically the same as that presented in the TensorFlow tutorial for transformer model

The output attention weights from my method implementation is:

[[0.2589478  0.42693272 0.15705977 0.15705977]
 [0.2772748  0.2772748  0.2772748  0.16817567]
 [0.33620113 0.33620113 0.12368149 0.2039163 ]]

The output expected by the test case is:

[[0.30719590187072754, 0.5064803957939148, 0.0, 0.18632373213768005],
 [0.3836517333984375, 0.3836517333984375, 0.0, 0.2326965481042862],
 [0.3836517333984375, 0.3836517333984375, 0.0, 0.2326965481042862]]

Try tf.keras.activations.softmax(), and the argument is the scaled attention logits.
I think (axis)(logits) is the wrong syntax.

The tf.keras.activations.softmax() also resulted in the same output.

Interestingly, I didn’t get a syntax error with tf.keras.layers.Softmax(axis)(logits), but nevertheless tried tf.keras.layers.Softmax()(logits) (since axis = -1 is default anyways).

Having no idea what to try anymore, I tried copy-pasting the code parts of the TensorFlow Tutorial, but the output from the method is still exactly the same.

When you’re using the functional API, the data parameter goes inside the parenthesis. Not outside as a separate argument.

Yes, that is what I have done for the method you have suggested. What I have at currently:
attention_weights = tf.keras.activations.softmax(scaled_attention_logits)
This still results in the assertion error due to the output still being

Please note: providing the inputs outside is shown in the documentation for tf.keras.layers.Softmax():

layer = tf.keras.layers.Softmax()

Check if you are using (1-mask).