Hi, so for the scaled_dot_product_attention I keep failing the unit tests and the section test as well for context here is my logic :
first I multiply the dot product of q k.Transposed to produce matmul_qk,
then scaled matmul_qk with the square root of dk such that dk is cast as tf.shape(k)[-1] dtype=tf.float3,
and then scaled the attention_logits to matmul_qk/tf.math.sqrt(dk)
I am not sure if any of the above steps are wrong or not, but the following I assume I have implemented correctly where the mask provided is first subtracted by 1 as in (1. - mask ) and then multiplied to 1e9 if the mask exists, after which we add it to the scaled attention weights,
I then fed it to the softmax activation and then multiplied the product with v and returned it
for more info here is my output :