Week 4: Transformer Network (test time intuition)

Hi, after completing week4 + assignment, I have a few questions about how this transformer works during test (aka training w/ the Decoder mask). I’m going to write my high-level assumption, please let me know

Once we train the model, we now have some weight matrices: W_q, W_k, W_v

Now during test time, when we receive X. We can compute Attention(W_q * x,W_q * x, W_v * x).

Question 1. So during test - we are able to compute each X<i> concurrently?
** (technically as its already vectorized in the above case)**

Question 2. As I understand it, the decoder still has to run sequentially? While we have computed W_Q , the target (decoder output for previous step) changes as we make predictions still . Thus Q= (y^<i>_hat, W_Q)

Jane _ _ _ _
Jane visits _ _ _
Jane visits Africa _ _

I hope that you were able to find the answer to your questions.