Model arch: seq to seq or seq to 1?

Hi, I found the shape of model’s output is (batch_size, max_time_steps, vocab_size). And the input sequence is shifted target sequence. So I thought the model arch should be like


Then the loss layer of cross entropy should be calculated as an average one over time steps.

But the picture in notebook is


I’am confused about it because it makes me think the target is one value rather than a sequence.

The loss of a model can be calculated either as an average of each step or only at the last step, depending on the specific implementation and task at hand. e.g., averaging over time steps is used in sequence prediction, language modeling, or machine translation tasks.

However, there are scenarios where the loss is calculated only at the last step of the sequence. This typically occurs when the task requires making a single prediction or classification based on the entire sequence. For example, in sentiment analysis or document classification.

Hi @guanwei_hu1

You are correct.

I drew some rectangles over your picture:
image

  • red rectangle - the output of the model and the loss is calculated of every prediction;
  • green rectangle - the prediction used for next_symbol;
  • blue rectangle - basically the same picture as in the notebook.

In other words, you are correct and it depends how you interpret the picture in the notebook (it shows how we predict the next token).

Cheers