Ok, I pass that step by switching out1, now I get errors in both tests, with and without padding mask. The values in out[0,0] do not match with those from the test template
Edit:
I am idiot, I found the error because I was not doing the 3 layer normalization correctly