Weights differ from the expected weights (tried with all diff axis values)
The most likely errors are:
- your value for the scaled_attention_logits are incorrect.
- or your call to tf.keras.activations.softmax() is incorrect.
For the logits:
The value of dk is the number of rows in k. You can get that with np.shape(k).
You need the correct dk in order to compute the scaled_attention_logits and it’s also the axis= parameter in tf.keras.activations.softmax().