So the trax implementation of AttentionQKV seems a bit different from the way it was explained in the videos. Specifically the parallel Dense layers and the ouput dense layers. Why do we need those Dense layers?
Why couldn’t we do directly QKV formula the way it was explained? Is there somewhere I can read more about it?