From my understanding, during inference the previous predictions of the decoder are fed back into the it to predict the next word. But during training the answer key (correct translation) is fed into the decoder every time step instead of the previous predictions. Any word of the correct translation that is past the current time step is “masked” to prevent leftward flow of information. This is done by setting the corresponding input into the softmax during self-attention to -inf. My question is why is the masking done right before the softmax of self-attention? Why can’t we just set a portion of the input sequence to -inf right away before even passing it into the decoder?
I think you understood the flow well, but, one key point that I may want to correct is, during training, it’s a vectorized operation with a distributing computing manner using multiple heads. It’s NOT time step based operation like RNN.
Here is an overview of the first part of Decoder function. “Look-ahead” mask is used in the first multi-head attention layer.
All words in a sentence are processed in parallel with a vectorized operation. So, we need to create a vectorized mask to remove influences from future words.
This is an amazing diagram. I think I understand how the look-ahead masking works now. Do you happen to have diagrams for the other sublayers?
Also, the decoder can only predict one word at a time, no? After every word, we update the look-ahead mask. That aspect cannot be done is parallel, so it’s similar to how an RNN decoder works?
Do you happen to have diagrams for the other sublayers?
I have some drafts, but not finished yet, simply because it depends on questions from learners.
Also, the decoder can only predict one word at a time, no? After every word, we update the look-ahead mask. That aspect cannot be done is parallel, so it’s similar to how an RNN decoder works?
Transformer is pretty much focusing on the quality, i.e, reducing training time to learn large amount of samples with vectorization and parallelization, and obtaining better BLEU (bilingual evaluation understudy) score with an attention mechanism.
As you pointed out, in the inference time, decoder processes one-by-one. Its step is like creating a whole sentence one-by-one as output. At the time “t”, input to Decoder is a sentence created by then, not just a single “word” at “t-1”. Of course, this input to Decoder does not include future words. This is the reason why Transformer uses a look-ahead mask to create this situation at a training time. On the other hand, it is not a streaming line process, but an iteration of one-by-one word generation. So, in some cases, at the inference time, Transformer may be slower than the world fastest LSTM stacking system.
As a net, at the inference time, Transformer looks like RNN, but actually is not…
Hope this helps.
One more quick question… Is the look-ahead mask applied during inference, or only during training (teacher forcing)? Because during inference, the we feed the decoder previous predictions, not the completed translation. So why do would we need to mask future values in this case?
I think it is an implementation matter, not an architecture matter.
What I saw was a quite simple one to reuse exact same algorithm even for inference.
Starting from zero output (in that case, sometimes, prediction routine add <SOS>…), call Transformer and get a predicted word by extracting the last word. (note that <SOS> is a start token.) Then, append it to the output, which is now “<SOS>, <1st>”. Then, call Transformer, get a predicted word again, and append to output like “<SOS>, <1st>, <2nd>”. So, this is totally reusing the same routines. Of course, training flag is set to False, but it only works for Dropout. All masks are used as is.
I think this is one way. The design point is to reuse a proven algorithm for inference. Of course, you can set the different design point, like speed or other. Or, there are some implementation to prepare different algorithm for inference without masks. But, again, that is an implementation matter.
Hi anon
I have just read your following two statements:
- It’s NOT time step based operation like RNN.
- All words in a sentence are processed in parallel
If I get this right then I think my understanding about look ahead masking was always wrong.
A) WHAT I THOUGHT SO FAR
I thought that with decoder-based LLM like GPT you don’t just do the predictions step-by-step (i.e., word-by-word) but you also do the training step-by-step; e.g. as follows for the training sentence “Jane visited Africa in September”:
- Step 1: Give the model the token “Jane” by masking away the other four tokens of the sentence. Loss = difference between single predicted token and “visited”.
- Step 2: Give the model the tokens “Jane” and “visited” by masking away the other three tokens of the sentence. Loss = difference between the single predicted token and “Africa”. Here, bi-directional attention amongst the input tokens is performed; i.e. “Jane” token not only pays attention to “Jane” but also to “visited” token (which is at the right side of the “Jane” token), because this is part of the known input.
- Step 3, 4, 5: Etc.
I.e., training is done step-by-step and there is bi-directional attention among all the input tokens in each step. In each step the model predicts one next word and comares this prediction with the correct next word to determine the loss of this step. The attention mask is not a diagonally cut mask but rather a vertically cut mask with 1s going more and more to right side. E.g. here for step 2:
[[1, 1, 0, 0, 0],
[1, 1, 0, 0, 0],
[1, 1, 0, 0, 0],
[1, 1, 0, 0, 0],
[1, 1, 0, 0, 0]]
B) WHAT I THINK AFTER READING YOUR COMMENTS
If I understand you correctly, then A) is completely wrong. Decoder-based models like GPT are trained in a single step (and that’s why people say their training is hugely more efficient than RNN) as follows:
- Single step: You input the whole sentence “Jane”, “visited”, “Africa”, “in”, “September” into the model at once. In the self attention modules, each token only pays attention to the tokens at the left side (i.e., not bidirectional but instead unidirectional attention to tokens at the left). E.g., “Jane” only pays attention to the “Jane” token and not to the “visited” token. As a result, the model not only predicts the next token but it predicts the whole output sequence/sentence at once. If we assume that this whole output sequence is made of the tokens “output token 1”, “output token 2”, “output token 3”, “output token 4”, “output token 5”, then the total loss of this single step = sum(difference(“Jane”, “output token 1”), difference(“visited”, "output token 2), …). And this “one step calculates the whole output sequence”-approach only works if we make sure that each input token only pays attention to the input tokens on the left via the following diagonal mask:
[[1, 0, 0, 0, 0],
[1, 1, 0, 0, 0],
[1, 1, 1, 0, 0],
[1, 1, 1, 1, 0],
[1, 1, 1, 1, 1]]
=> Am I correct that you are saying that B) is correct and A) is wrong?
If this is correct, than one might ask oneself why losing context information by going with unidirectional attention (option B) instead of bidirectional attention (option A). If I understand correctly, then in an ideal world we would indeed prefer the create a better model by following A). But a step-by-step training takes so much more time that with A) you would have to sacrifice the training volume (i.e., you would not be able to process a training volume as big as you can do it with the more efficient training option A)).
=> Does that make sense?