hi @hj320
Errors in Scaled dot product attention grade cell.
-
For code multiply q and k transposed. You are using incorrect python function code for multiply q and k.
In the additional hints section just before the grade cell, it mentions
you may find tf.matmul useful for matrix multiplication (check how you can use the parameter transpose_b) -
Next to calculate dk, kindly use tf.shape rather than k.shape. Also as you know dk is the dimension of the keys, which is used to scale everything down so the softmax doesn’t explode. So dimension reduction is [-1] not -2.
In the same next code line, to calculate scaled attention logits, in denominator you are suppose to use tf.math.sqrt(dk) and not dk**0.5 as dk come in square root as per calculation. -
While adding mask to the scaled tensor, your code is right but we have seen even not mention decimal point makes different to scaled weight, so instruction mentions to Multiply (1. - mask) by -1e9 before but you multiplied (1-mask). Make sure you multiply just the way instructions mentions before the grade cell.
-
While softmax is normalized, you do not require to add any axis argument as you are only require to use right activation function which you did. So remove axis=-1.
Let me know after these corrections, what is the progress.
Regards
DP