I may not be understanding transformers at a conceptual level, but I don’t think the lecture covers my question.
I noticed that, when we implemented the encoder, we didn’t keep track nor update the attention weights. However, we did that in the decoder. It seems that the weights weren’t used later on either; but even if they were, why wouldn’t the encoder’s weights be tracked.