How the loss of the Causal Language Model is calculated?

When creating a post, please add:

  • Week # 1
    * Link to the classroom item you are referring to:
    I now understand that transformers is doing “next-token-prediction” during training. I want to know that when calculating the loss during training, if the model’s output tokens’ length is longer or shorter than the ground-truth tokens, how to deal with this case? Do we truncate the output or the label to align their length before calculating the Cross-Entropy? or we do some padding to make their length equal to each other?

Yes all inputs and labels (these are normally provided of same length but the process is the same as with inputs) are made during the processing stage of equal length, either padding it or truncating it, so all labels and outputs have same length!

1 Like