Can someone explain to me how we got the convex loss function? Squared error function makes sense as it represents the difference thus, analogous to loss. However, being a non-convex, we go for a logarithmic function. How we got this? any derivations?
Hey there , The convex loss function , owning to a single global minimum , makes it easy to know when when have reached it . As explained in the lecture having a function with different local minimum not be ideal as thus our model wouldn’t be able to find the lowest possible cost function .Thus finding more accurate model would be tough. Scouring though the net I found some mathematical explanation as you asked here. Enjoy learning and hopefully i answered some of the questions in your mind.
P.S. I am also new to the course so sorry if I made a mistake somewhere hopefully mentor would correct it.
That is not what I asked, however, thank you for trying.
I will try reiterating what I am trying to understand:
As we try finding the critical point we see a gradient ascent, and at some point we see a maxima, given it is logarithmic, we end up with Global Maxima. However, since we are trying to find minima, we convert it into a minimization problem. Cost function is obtained from maximum likelihood estimation(MLE) of loss funtion. I did not try it on the loss function myself, however, my guess is: Since multiplication of 2 convex function does not guarantee a convex, it possibly is the cause why Cost function is not convex once it goes through multi layer.
Kindly indulge my silly questions:
1. Why does maximum Likelihood Estimation gives cost function? (Given we assumed its IDD, the likelihood is the product of the likelihood of each data point, but what that has to do with cost? also am I taking any wrong assumptions if I consider loss and cost functions as just “Loss and Cost” to see the relation clearly?)
2. Given my assumption that “Since multiplication of 2 convex function does not guarantees a convex, it possibly is the cause why Cost function is not convex once it goes through multi layer” is correct, and we know that property of log will convert product into summation, and summation of all convex functions is always a convex, and since MLE is just a sample mean in the end, means it is only scaling our function, which should not affect maximum or minimum, shouldn’t that result in a convex contrary to what a professor said: “Once we go to multi-layer Neural Networks, the cost function will no longer be convex, even though it is the same function”. I know I am probably making a blunder in my assumption somewhere, kindly point me in right direction.
3. If " log loss not being convex in multi-layer" is true, How does it even converges? Shouldn’t there be a huge possibility that it might just end up with some local optimum and not converge any further?
Thank you.
I am going to sum up (and fill in a bit) the optional video “Explanation of the logistic regression cost function” so that we can both agree on some things, and then we can take it from there.
The key point to your first question is that ML estimation has very desirable properties from the perspective of statistical theory. To use it, we must start with a (parametric) probability model. Let’s call it p(y, \theta) where y is an observed set of data and \theta is a vector of parameters that help define the probability model. The method of maximum likelihood finds the value of \theta which is “most likely” to have produced the data. That is, to “learn” the “best” set of parameters, one must find those that maximize the likelihood function with respect to \theta given the data x.
Note: In machine learning and AI, the convention is to minimize the negative of the likelihood function where the negative of the likelihood function is interpreted as the “cost function.” We want to make the cost of “poorly chosen” parameters high, i.e., we want to penalize them. The goal can be stated either as finding the parameters that maximize the likelihood function or as finding those that minimize the cost function.
In binary classification using the logistic regression, the underlying probability model is the Bernoulli distribution:
p(y|x) = \hat{y}^y \left(1- \hat{y}\right)^{\left(1-y\right)} .
You should think of \hat{y}, the probability of “success” (it’s a cat, or y=1), as the parameter which we would like to estimate, or “learn”, given the data y (the labeled examples). Alternatively, I could have written the left-hand side as p(y|x; \theta), or more specific to the present application, p(y|x; w, b) to acknowledge the dependence on parameters.
As should be clear from the left-hand side, this is a conditional probability read: “the probability of y conditional on x,” where x represents the data or “training examples”). That’s because \hat{y}=\sigma(w^Tx + b) is another piece of the probability model.
With your observation that the training data, the \left(x,y\right) pairs, are independently and identically distributed, we can form the likelihood function by multiplying the above expression for the Bernoulli distribution over all examples to obtain the joint probability distribution. That’s the likelihood function. Since it’s often easier to work with sums we take the log of the resulting expression and turn the product into a sum. That’s the log-likelihood function. That’s OK. Since the log function is strictly increasing, maximizing the log-likelihood with respect to parameters is equivalent to maximizing the likelihood function. Or in the preferred language of machine learning, we minimize the negative of the log-likelihood (i.e. the cost function).
Key point: The binary cross-entropy cost function is derived from the principle of maximum likelihood.
Other observations:
- For the basic linear regression model, choosing parameters w and b by minimizing the mean-squared error can also be derived from the likelihood principle where the underlying probability model is Gaussian (based on the Normality of errors) . The MSE function is convex and so has a unique global minimum.
- In the case of logistic regression, the binary cross-entropy cost function (derived from the likelihood principle and based on the Bernoulli distribution) is also convex.
- As Prof Ng states, we lose that convexity in a multi-layer net, so in principle, we need to be concerned about settling into a local optimum rather than the global one. Much, much more will be said about this in Course 2. So, hang on.
- I have not done this, but if you wanted to convince yourself of the above fact, at least you have the tools of the likelihood principle to help you think about it.
Let me know if this helps. Cheers, @kenb