Hello, sirs!
I wonder if I correctly follow the logic behind the argument in self.mha
call.
self_mha_output = self.mha(x, x, x, mask)
Do we have to use the same x
parameter for all query, keys and values tensors because in step 6.1 Encoder Layer we build a self-attention mechanism?