Why MSE is non-convex for Logistic regression

As explained in the video logistic regression cost function, it is mentioned that Using MSE as a Loss function for logistic regression makes it non-convex. Can someone prove it (mathematically and visually both) or develop an intuition of it?

Unconvinced explanations:

Statistical ML theory: I understand the statistical ML theory which says that loss function is NLL of the model and for best parameters, we need to minimize NLL loss and where we can prove that linear regression has gaussian distribution and Logistic regression has Bernoulli type of distribution thus NLL is MSE and Cross entropy respectively.
But still, it doesn’t answer the question of what makes MSE non-convex in logistic regression but not in linear regression

Penalization theory: There is one more theory, which is the so-called “Penalization Theory” It can also be said that Cross entropy penalizes by a very large amount (statistically infinite) when the prediction goes wrong as compared to MSE which penalizes by the maximum value of 1 for the wrong prediction and thus, makes the loss function range very large

All the above theories explained the rationale behind using Cross entropy in Logistic regression as opposed to MSE.
However, my question is WHAT EXACTLY MAKES MSE NON-CONVEX FOR LOGISTIC REGRESSION. I read it on the web, it is due to the non-linear nature of the sigmoid which makes the loss function non-convex. I am still not able to visualize or develop an intuition of that.

Neither I am able to link the above theories with my question.

Can someone please explain this to me?

When I seached the web, I found this link: Squared Error vs Log Loss of Sigmoid

Okay this helps me clearly visually see that MSE is non-convex. But in the video, (Logistic regression cost function) mentioned by AndrewNg, he said problem is multiple local optima? HOW?

Hello @mayankb2103 and welcome to the DL specialization. MSE loss is the natural choice for linear regression. Minimizing the average MSE loss (ordinary least squares estimator) is a nice linear-quadratic problem guaranteeing a unique (i.e. global) minimum. (It also coincides with the ML estimator if you assume Gaussian errors.) This is the first slide of the “Gradient Descent” lecture from week 2. Nice pictures there.

Consider the least squares loss function:


To prove nonconvexity of the MSE loss function with the logistic model, I would substitute

into the quadratic loss function (above) and differentiate with respect to z and set it equal to zero. All you have to do is show that there is at least one other solution to that equation (checking that it’s a local min and not a local max). I have not done this; I hope you do! :slight_smile: That said, I am guessing that the affine form of z has no bearing on the proof. It might, in which case you would need to differentiate with respect to w and b and solve the higher dimensional system for zero (the zero vector). Ouch.