I completed the assignment and the model is working pretty well but the first exercise on implementing the scaled dot prod. attention function is not quite giving me your expected values - my values shown below. I implemented the code in a very similar way to the corresponding lab + happy to share privately. Not sure what why my values are not agreeing? Can’t tag this as requested btw. and also checked out similar posts …
Output:
[[[1. 0.67]
[0.67 0.67]
[0.75 0.5 ]]]
Attention weigths:
[[[0. 0.33 0. 0.33 0.33]
[0.33 0. 0. 0.33 0.33]
[0.25 0.25 0. 0.25 0.25]]]