Any one who can help me with that how squared error as cost function does not work well with gradient descent in sigmoid function or logistic regression but the cross entropy does work well.
thanks
refernce: week 2 lecture title logistic regression cost function
Hi, @muneer321, welcome to the community!
While the squared error cost function works well for linear regression, it poses problems when used with the sigmoid function in logistic regression. In logistic regression and neural networks, using the squared error as the cost function with a sigmoid activation function often leads to significant optimization problems because the cost function becomes non-convex. This means that the surface of the loss function has multiple local minima, making it difficult for gradient descent to find the global optimum. As a result, gradient descent can be slow in flat regions, which slows down learning and can lead to slow convergence. It can also get stuck in local minima, resulting in suboptimal solutions. Since squared error doesn’t enforce a clear global minimum, optimization becomes unreliable.
On the other hand, when cross-entropy is used with the sigmoid function in binary classification tasks such as logistic regression, it creates a convex loss surface. Thus, there is only one global minimum, which makes it easier for gradient descent to converge efficiently. In addition, logistic regression outputs probabilities between 0 and 1, and the cross-entropy loss directly models the distance between the predicted probability (sigmoid output) and the true label (either 0 or 1). It is a more natural fit because it penalizes misclassifications more effectively than squared error.
As explained in the lesson,
- When the true label is y = 1 , cross-entropy encourages the prediction \hat{y} to be as close to 1 as possible by minimizing -\log(\hat{y}) .
- When y = 0 , it encourages \hat{y} to be as close to 0 as possible by minimizing -\log(1 - \hat{y}) .
This structured gradient response gives cross-entropy an advantage. The cross-entropy loss increases rapidly as predictions deviate from the true label, providing strong gradient signals and allowing for faster learning and error correction. The squared error, by contrast, when the model predicts probabilities close to 0 or 1 (i.e., the neuron is saturated), the cost function’s gradient becomes very small. This makes it hard for the model to correct its mistakes, slowing down learning.
I hope this helps!
The cost functions are different because the goals are different.
- Linear regression tries to make a model that fits the examples.
- Logistic regression tries to create a boundary between the “true” and “false” examples.
Here’s another thread from a while ago which discusses this and also shows a graph of what the loss surface looks like if you use MSE for logistic regression. Sometimes a picture gets the message across better than words. It would be worth reading the earlier replies on that thread as well.
it helped however i did not understood how it works mathematically i mean when differentiating squared error which include the sigmoid function in y^ how the cost function leads to non convex function
Mathematically, a function is convex if its 2nd partial derivative is always positive or zero.
Certainly! The squared error cost function combined with the sigmoid activation in logistic regression results in a non-convex cost function because the second derivative (Hessian) is not always positive semidefinite, it can take negative values depending on the input data and parameters.