A question about lecture: Language Model and Sequence Generation


I have a question about the lecture " Language Model and Sequence Generation", at 10:19 min, where a loss function for time step has a sub index ‘i’. See below screenshot. I’m first guessing ‘i’ is summing over all the training examples, although the index is a subscript rather than a superscript defined in one of the earlier lectures for training examples. Another more likely guess is that it is summing over all 10,003 probabilities, because the predicted y is a softmax with 10,003 values. Any explanation is welcome.

Hi Ye,

Thanks for your question. Yes, I would say so too. From the week assignment of that lecture:

𝑦̂ is a 3D tensor of shape (𝑛𝑦,𝑚,𝑇𝑦)

  • 𝑛𝑦 : number of units in the vector representing the prediction
  • 𝑚: number of examples in a mini-batch
  • 𝑇𝑦: number of time steps in the prediction

So you need to compute the loss for all the examples in the batch.

Best and happy learning,