Sorry, but your analysis is incorrect. What the formula asks for is the sum over the max values, which is the second set of code. The first code is the max of the sums, which is not the same thing. The evidence that you get the correct answer with the second set of code is at least suggestive one would think. These courses were last updated in any major way in April of 2021. If the test code were incorrect, we would know about it by now.
I am grateful for your reply. I just realized that I misinterpreted equation (3). I apologize for raising a naive concern without thoroughly investigating it.
Yes, the point is that the loss is always the sum of the losses for the individual training samples and it is only at the level of the values for the individual samples that we need to do the max to assure that the answer is non-negative. So it should be clear from looking at the formula that the sum is the “outer” operation.
Glad to hear that it makes sense now. If it’s any comfort, you are far from the first person to step on that landmine.
I agree with your explanation about the loss function. I overlooked an important detail by not checking it thoroughly. Thanks again for your input on my concern and for sharing your perspective in your last comment.