Parallelism At Decoder Layer In Transformers

Kutay_Cavdar · June 21, 2023, 7:55am

I have understood how parallelism in the encoder layer works. But there is one thing I can’t wrap my head around and that is why we don’t do all computations in parallel in the decoder layer. I remember Andrew said, “Since we know what the true y output is, we can present each node in the decoder with correct previous words and see how well it can guess what the next word is”. In that case, we could use look_ahead_mask in each decoder node and can present the true output multiple times to calculate the loss function and do backpropagation instead of waiting for each node to make a prediction.

Thank you for your time.

balaji.ambresh · June 21, 2023, 10:25am

Just like an encoder, a decoder is also made of multiple layers. Please explain what you mean by this:

Kutay_Cavdar · June 21, 2023, 2:25pm

In the encoder, we don’t wait for the third word in the sentence in order to create an encoding for the fourth word. However, in the decoder layer, we need the prediction from the previous nodes to be able to make a prediction at some specific node. If we already have the correct translation at training, why not use look ahead mask and correct words from our data to be able to calculate loss functon & gradients faster?

balaji.ambresh · June 21, 2023, 5:33pm

During training, decoder attention happens in parallel since (as you pointed out), we have the correct translation.

Consider the task of translating english to spanish. Say we have already predicted the first 3 words. To predict the 4th word, we need the complete information about the english sentence and information about the first 3 spanish words. This is what the look ahead mask is meant to help with.

Please try the programming assignment and reply to this thread if something is unclear.

Kutay_Cavdar · June 23, 2023, 6:21pm

Oh okay so we form the words in the translation simultaneously in training right?

balaji.ambresh · June 23, 2023, 7:21pm

Your understanding is correct:

Kutay_Cavdar · June 24, 2023, 4:37pm

Thank you for your time. Have a good day.

Topic		Replies	Views
Masked Attention Transformers Sequence Models coursera-platform	6	803	September 27, 2024
Transformer parallelism Sequence Models coursera-platform	1	515	June 13, 2022
Week 4: Transformer Network (test time intuition) Sequence Models coursera-platform	1	516	April 21, 2022
Video: NMT Model with Attention NLP with Attention Models week-module-1	5	382	December 21, 2023
I don't understand the transformer's decoder Generative AI with Large Language Models week-module-1	2	193	July 24, 2024

Parallelism At Decoder Layer In Transformers

Related topics