Actually, now that I think \epsilon harder about what David is saying there, there can be an arbitrary part of the relationship between L and J in some cases, e.g. the L2 term when we are doing L2 regularization. That is added to the mean of the L values to get the final J value that is used for computing gradients in that case. There are other forms of regularization that add different additional terms, e.g. L1 or “Lasso” regularization which adds a term based on the sums of the absolute values of the weights. But the “base” unregularized cost J is just the mean of L over the samples.
One other side question that is worth mentioning is that people are frequently curious about why the L2 regularization term is scaled by \frac {1}{m}. That makes it look a bit like an average, but the sum is not over the samples of course. I don’t know the answer and Prof Ng does not discuss this in the DLS C2 lectures (at least that I can recall), but one theory would be that the purpose is to make the value of the hyperparameter \lambda orthogonal to the dataset size. Here’s a thread which discusses that a bit more. And actually here’s a thread in which @conscell points out that Prof Ng does say more about this in the MLS lectures and does confirm the “hyperparameter orthogonality” motivation for doing the scaling that way.