Explanation of Logistic Regression Cost Function

kzed · June 19, 2021, 1:53pm

Hi!

I have a question regarding Professor Ng’s video, “Explanation of Logistic Regression Cost Function (Optional)” in week 2. When he notes that we are trying to maximise p(labels in training set), I just want to make sure that this means we are trying to maximise the products of the probabilities, over all training examples, that our classifier is predicting for each given example to match each given y label?

If so, is there an intuitive understanding of what this p(labels in training set) actually means? I understand that a higher value would imply better classification performance, but in terms of probability can it be understood as sort of the probability that our classifier would label EVERY example correctly (since products of probabilities for A, B is probability of A and B assuming independence)?

Thank you!

kenb · June 21, 2021, 1:50pm

Hi @kzed and welcome to the DL specialization! Yes, I believe that is the correct interpretation. As you followed the video, you may have realized that parameters of the model are chosen according to the maximum likelihood principle. (With the proviso that we prefer to minimize a cost function, so that the likelihood function is multiplied by minus one.) And the likelihood function is obtained by multiplying the individual Bernoulli distributions,

p(y|x)=\hat{y}^y\left(1-\hat{y}\right)^{1-y},

because the individual “trials” are independent. Then, since it’s nice to work with sums rather than products we apply a natural log transformation to get the log-likelihood function.

Here, we think of \hat{y} as the probability of a “positive” outcome (it’s a cat) given x– the evidence offered by the image. You can also think of this as a probability model for a weighted coin toss where \hat{y} is the probability that a single toss turns up ‘heads’. In the coin-toss case, we would not have the conditioning information x. The coin is the coin. In binary classification the information that the image contains two pointed ears tilts the probability in favor of ‘cat’.

I think that the essence of your question(s) revolves around the likelihood principle: that we choose model parameters (via gradient descent) that makes the data (the training examples) the most likely to have been observed given the probability model.

I hope this helps. If not, give another shout!

kzed · June 21, 2021, 3:34pm

Hi @kenb ,

Thank you very much for the answer!

Best,
Kevin

Topic		Replies	Views
Week 2 Logistic Regression Cost Function Video Neural Networks and Deep Learning coursera-platform	1	535	December 15, 2021
NLP C1_W1 Math Derivation in Cost function NLP with Classification and Vector Spaces week-module-1	1	582	August 22, 2022
Week 2: Explanation of Logistic Regression Cost Function Neural Networks and Deep Learning coursera-platform	1	575	June 23, 2021
Explanation of Logistic Regression Cost Function (Optional) Neural Networks and Deep Learning week-module-2 , coursera-platform	3	23	November 3, 2024
Week 2 Logistic Regression Cost Function Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	551	March 26, 2022

Explanation of Logistic Regression Cost Function

Related topics