AttentionQKV dense layers

So the trax implementation of AttentionQKV seems a bit different from the way it was explained in the videos. Specifically the parallel Dense layers and the ouput dense layers. Why do we need those Dense layers?

Why couldn’t we do directly QKV formula the way it was explained? Is there somewhere I can read more about it?

1 Like

I’m curious about this as well. It seems to contradict what they say in the lecture video where they explicitly say scaled dot-product attention is without any neural networks, so it would be nice to have the role of the dense layers in the Trax implementation cleared up.