Parallelism At Decoder Layer In Transformers

I have understood how parallelism in the encoder layer works. But there is one thing I can’t wrap my head around and that is why we don’t do all computations in parallel in the decoder layer. I remember Andrew said, “Since we know what the true y output is, we can present each node in the decoder with correct previous words and see how well it can guess what the next word is”. In that case, we could use look_ahead_mask in each decoder node and can present the true output multiple times to calculate the loss function and do backpropagation instead of waiting for each node to make a prediction.

Thank you for your time.

Just like an encoder, a decoder is also made of multiple layers. Please explain what you mean by this:

In the encoder, we don’t wait for the third word in the sentence in order to create an encoding for the fourth word. However, in the decoder layer, we need the prediction from the previous nodes to be able to make a prediction at some specific node. If we already have the correct translation at training, why not use look ahead mask and correct words from our data to be able to calculate loss functon & gradients faster?

During training, decoder attention happens in parallel since (as you pointed out), we have the correct translation.

Consider the task of translating english to spanish. Say we have already predicted the first 3 words. To predict the 4th word, we need the complete information about the english sentence and information about the first 3 spanish words. This is what the look ahead mask is meant to help with.

Please try the programming assignment and reply to this thread if something is unclear.

Oh okay so we form the words in the translation simultaneously in training right?

Your understanding is correct:

1 Like

Thank you for your time. Have a good day.