Hi, I just finished the lectures on W4, and I have some questions about the decoder part of the transformer
So, I wonder if all the words in the vocabulary are being used in this state when the Q vector gets fed into the second multi-head attention in the picture, or does it just use the token up to the time being generated?
I am pretty confused about this
For example, in Prof. Andrew’s example, “Jane Visit l’Afrique en September,” we have all the tokens right here when we are at the encoder and can compute the dot products Q and K.
But in the decoder, we have each token only when that token is previously generated, like the 1st time step, we only have SOS, and this makes me wonder if the vocabulary words are being used somewhere here to predict “Jane”
I appreciate any help you can provide.