The formula you show is the cross entropy loss function for a binary (yes/no) classification, in which you would have (of course) only two terms. The first term is the loss in the y = 1 case and the second is the loss for the y = 0 case.

When you generalize to a multiclass problem with more than 2 possible output classes, we switch from sigmoid for the output activation to softmax, which gives us a probability distribution across the various output classes for any given sample. So the loss function is just the generalization of the loss function for the binary case: it is still the same cross entropy calculation, but only one term will be selected by the y label value for each sample. It’s exactly the same formula as the one for the binary case if you think about it in that way.

I forget how much detail Prof Ng goes into in the lectures about explaining that. You should scan through those lectures again. Or there’s a nice lecture on YouTube by Prof Geoff Hinton covering softmax and the cross entropy loss function.

Or if your question is more basic about why the logarithm is used there, it comes from “maximum likelihood estimate” in statistics. Here’s a thread from mentor Raymond that gives a nice intuitive explanation with examples. And here’s a thread that shows the graph of log between 0 and 1.