Explanation of Logistic Regression Cost Function

Hi!

I have a question regarding Professor Ng’s video, “Explanation of Logistic Regression Cost Function (Optional)” in week 2. When he notes that we are trying to maximise p(labels in training set), I just want to make sure that this means we are trying to maximise the products of the probabilities, over all training examples, that our classifier is predicting for each given example to match each given y label?

If so, is there an intuitive understanding of what this p(labels in training set) actually means? I understand that a higher value would imply better classification performance, but in terms of probability can it be understood as sort of the probability that our classifier would label EVERY example correctly (since products of probabilities for A, B is probability of A and B assuming independence)?

Thank you!

Hi @kzed and welcome to the DL specialization! Yes, I believe that is the correct interpretation. As you followed the video, you may have realized that parameters of the model are chosen according to the maximum likelihood principle. (With the proviso that we prefer to minimize a cost function, so that the likelihood function is multiplied by minus one.) And the likelihood function is obtained by multiplying the individual Bernoulli distributions,

p(y|x)=\hat{y}^y\left(1-\hat{y}\right)^{1-y},

because the individual “trials” are independent. Then, since it’s nice to work with sums rather than products we apply a natural log transformation to get the log-likelihood function.

Here, we think of \hat{y} as the probability of a “positive” outcome (it’s a cat) given x– the evidence offered by the image. You can also think of this as a probability model for a weighted coin toss where \hat{y} is the probability that a single toss turns up ‘heads’. In the coin-toss case, we would not have the conditioning information x. The coin is the coin. In binary classification the information that the image contains two pointed ears tilts the probability in favor of ‘cat’.

I think that the essence of your question(s) revolves around the likelihood principle: that we choose model parameters (via gradient descent) that makes the data (the training examples) the most likely to have been observed given the probability model.

I hope this helps. If not, give another shout! :slight_smile:

Hi @kenb ,

Thank you very much for the answer!

Best,
Kevin