Hi, so for the scaled_dot_product_attention I keep failing the unit tests and the section test as well for context here is my logic :

first I multiply the dot product of q k.Transposed to produce matmul_qk,

then scaled matmul_qk with the square root of dk such that dk is cast as tf.shape(k)[-1] dtype=tf.float3,

and then scaled the attention_logits to matmul_qk/tf.math.sqrt(dk)

I am not sure if any of the above steps are wrong or not, but the following I assume I have implemented correctly where the mask provided is first subtracted by 1 as in (1. - mask ) and then multiplied to 1e9 if the mask exists, after which we add it to the scaled attention weights,

I then fed it to the softmax activation and then multiplied the product with v and returned it

for more info here is my output :