Hi, I found the shape of model’s output is (batch_size, max_time_steps, vocab_size). And the input sequence is shifted target sequence. So I thought the model arch should be like
The loss of a model can be calculated either as an average of each step or only at the last step, depending on the specific implementation and task at hand. e.g., averaging over time steps is used in sequence prediction, language modeling, or machine translation tasks.
However, there are scenarios where the loss is calculated only at the last step of the sequence. This typically occurs when the task requires making a single prediction or classification based on the entire sequence. For example, in sentiment analysis or document classification.