All previously generated tokens as decoder input or only the latest generated token as decoder input

Kathy83 · July 10, 2024, 8:06am

Hi there, I have finished the assignments of Course4 week1/2/3, but I have a question about the decoder input: In Week1 Assignment(NMT), the function ‘translate()’ gets the translation in a for loop as following. It takes only the single latest generated token into the process.

   # Iterate for max_length iterations
   for i in range(max_length):
        # Generate the next token
        next_token, logit, state, done = generate_next_token(model.decoder,context,next_token,...)

In Week2 Assignment(Summarization), the function ‘summarize()’ gets summary in a for loop similarly. But it takes all previously generated token as decoder input:

   output = tf.expand_dims([tokenizer.word_index["[SOS]"]], 0)
   for i in range(decoder_maxlen):
        predicted_id = next_word(model, encoder_input, output)
        output = tf.concat([output, predicted_id], axis=-1)

The week3 Assignment(Question Answering) does exactly the same as week2 Assignment. Comparing these differences, why the first assignment doesn’t concatenate all previously generated tokens as decoder input to generate next token? In translation, shouldn’t the generation of the next word depend on all of the previously generated words? Can someone help me to make it clear? Thank you!

Anna_Kay · July 13, 2024, 2:49pm

Hello @Kathy83!

In short, the difference between Week 1 Assigment and Week 2 & 3 Assigments, is that although all three use encoder-decoder approaches, Week 1 Assigment involves an LSTM (RNN) based encoder-decoder model, while Weeks 2 & 3 involve Transformer based encoder-decoder model.

This differentiation in architecture is the reason behind what you asking; the difference in the two code snippets you provided is not due to the different tasks they are tackling, translation and summarization respectively, but due the first being an LSTM approach while the second a Transformer appraoch.

Also your intuition regarding the question "In translation, shouldn’t the generation of the next word depend on all of the previously generated words? ", is correct. Translation, similarly to summarization, does benefit from the information provided by all the previously generated tokens, and Transformer architecture does this in an efficient way (processing multiple tokens simultaneously using self-attention mechanisms).

In LSTMs the information from all previously generated tokens is not discarded, it is incorporated in the hidden state (state variable), and processed sequentially.

The Transformer approach proved to be more efficient than the LSTM approach.

Kathy83 · July 14, 2024, 3:50pm

Hi @Anna_Kay , thanks a lot for your explanation, I understood it finally!

Topic		Replies	Views
C4W1_Assignment - Translate Function NLP with Attention Models week-1	5	434	March 14, 2024
C4W1_Assignment - Exercise 5 NLP with Sequence Models week-1	41	1307	May 28, 2024
The tokens that decoder block use Sequence Models week-4	3	204	April 15, 2024
C5_W4 Transformer - Flummoxed. Why do we pass the output sentence to the decoder Sequence Models	6	526	May 17, 2023
Questions about Transformer Models Generative AI with Large Language Models week-1	2	362	October 23, 2023

All previously generated tokens as decoder input or only the latest generated token as decoder input

Related topics