Hi,
I have a question about the lecture " Language Model and Sequence Generation", at 10:19 min, where a loss function for time step has a sub index ‘i’. See below screenshot. I’m first guessing ‘i’ is summing over all the training examples, although the index is a subscript rather than a superscript defined in one of the earlier lectures for training examples. Another more likely guess is that it is summing over all 10,003 probabilities, because the predicted y is a softmax with 10,003 values. Any explanation is welcome.