All previously generated tokens as decoder input or only the latest generated token as decoder input

Hi there, I have finished the assignments of Course4 week1/2/3, but I have a question about the decoder input: In Week1 Assignment(NMT), the function ‘translate()’ gets the translation in a for loop as following. It takes only the single latest generated token into the process.

   # Iterate for max_length iterations
   for i in range(max_length):
        # Generate the next token
        next_token, logit, state, done = generate_next_token(model.decoder,context,next_token,...)

In Week2 Assignment(Summarization), the function ‘summarize()’ gets summary in a for loop similarly. But it takes all previously generated token as decoder input:

   output = tf.expand_dims([tokenizer.word_index["[SOS]"]], 0)
   for i in range(decoder_maxlen):
        predicted_id = next_word(model, encoder_input, output)
        output = tf.concat([output, predicted_id], axis=-1)

The week3 Assignment(Question Answering) does exactly the same as week2 Assignment. Comparing these differences, why the first assignment doesn’t concatenate all previously generated tokens as decoder input to generate next token? In translation, shouldn’t the generation of the next word depend on all of the previously generated words? Can someone help me to make it clear? Thank you!

Hello @Kathy83!

In short, the difference between Week 1 Assigment and Week 2 & 3 Assigments, is that although all three use encoder-decoder approaches, Week 1 Assigment involves an LSTM (RNN) based encoder-decoder model, while Weeks 2 & 3 involve Transformer based encoder-decoder model.

This differentiation in architecture is the reason behind what you asking; the difference in the two code snippets you provided is not due to the different tasks they are tackling, translation and summarization respectively, but due the first being an LSTM approach while the second a Transformer appraoch.

Also your intuition regarding the question "In translation, shouldn’t the generation of the next word depend on all of the previously generated words? ", is correct. Translation, similarly to summarization, does benefit from the information provided by all the previously generated tokens, and Transformer architecture does this in an efficient way (processing multiple tokens simultaneously using self-attention mechanisms).

In LSTMs the information from all previously generated tokens is not discarded, it is incorporated in the hidden state (state variable), and processed sequentially.

The Transformer approach proved to be more efficient than the LSTM approach.

1 Like

Hi @Anna_Kay , thanks a lot for your explanation, I understood it finally!

1 Like