Hi,
I am having a problem in understanding the code for the encoder layer. In order to compute the self attention, it is mentioned 1. You will pass the Q, V, K matrices and a boolean mask to a multi-head attention layer. Remember that to compute self -attention Q, V and K should be the same." How do I calculate Q, V and K?
For self-attention, Q, K, and V are all the same - the ‘x’ variable. You need to use it three times.
‘mask’ is provided as a function parameter, you need to pass that also.
For dropout1, you need to also pass “training=training”.
For out2, you need to use out1. not attn_output.
Also, please edit your message to remove the code. Posting your code isn’t allowed by the course Honor Code.