Hi, after completing week4 + assignment, I have a few questions about how this transformer works during test (aka training w/ the Decoder mask). I’m going to write my high-level assumption, please let me know
Encoder:
Once we train the model, we now have some weight matrices: W_q, W_k, W_v
Now during test time, when we receive X. We can compute Attention(W_q * x,W_q * x, W_v * x).
Question 1. So during test - we are able to compute each X<i> concurrently?
** (technically as its already vectorized in the above case)**
Question 2. As I understand it, the decoder still has to run sequentially? While we have computed W_Q , the target (decoder output for previous step) changes as we make predictions still . Thus Q= (y^<i>_hat, W_Q)
Jane _ _ _ _
Jane visits _ _ _
Jane visits Africa _ _
…