The tokens that decoder block use

Hi, I just finished the lectures on W4, and I have some questions about the decoder part of the transformer

So, I wonder if all the words in the vocabulary are being used in this state when the Q vector gets fed into the second multi-head attention in the picture, or does it just use the token up to the time being generated?

I am pretty confused about this
For example, in Prof. Andrew’s example, “Jane Visit l’Afrique en September,” we have all the tokens right here when we are at the encoder and can compute the dot products Q and K.
But in the decoder, we have each token only when that token is previously generated, like the 1st time step, we only have SOS, and this makes me wonder if the vocabulary words are being used somewhere here to predict “Jane”

I appreciate any help you can provide.

1 Like

Hi @DavidBetterFellow

In the transformer model, during the decoding phase, only the tokens that have been generated up to the current time step are available. At each time step of the decoder, the model predicts the next token based on the previously generated tokens.

The decoder is intentionally restricted from accessing future words, thereby preventing any visibility into forthcoming information.

2 Likes

Thanks your explanation

1 Like

You’re welcome :raised_hands:

2 Likes