Use of squared error with sigmoid and applying gradient descent

muneer321 · September 29, 2024, 12:45pm

Any one who can help me with that how squared error as cost function does not work well with gradient descent in sigmoid function or logistic regression but the cross entropy does work well.
thanks
refernce: week 2 lecture title logistic regression cost function

nadtriana · September 29, 2024, 2:06pm

Hi, @muneer321, welcome to the community!

While the squared error cost function works well for linear regression, it poses problems when used with the sigmoid function in logistic regression. In logistic regression and neural networks, using the squared error as the cost function with a sigmoid activation function often leads to significant optimization problems because the cost function becomes non-convex. This means that the surface of the loss function has multiple local minima, making it difficult for gradient descent to find the global optimum. As a result, gradient descent can be slow in flat regions, which slows down learning and can lead to slow convergence. It can also get stuck in local minima, resulting in suboptimal solutions. Since squared error doesn’t enforce a clear global minimum, optimization becomes unreliable.

On the other hand, when cross-entropy is used with the sigmoid function in binary classification tasks such as logistic regression, it creates a convex loss surface. Thus, there is only one global minimum, which makes it easier for gradient descent to converge efficiently. In addition, logistic regression outputs probabilities between 0 and 1, and the cross-entropy loss directly models the distance between the predicted probability (sigmoid output) and the true label (either 0 or 1). It is a more natural fit because it penalizes misclassifications more effectively than squared error.

As explained in the lesson,

When the true label is y = 1 , cross-entropy encourages the prediction \hat{y} to be as close to 1 as possible by minimizing -\log(\hat{y}) .
When y = 0 , it encourages \hat{y} to be as close to 0 as possible by minimizing -\log(1 - \hat{y}) .

This structured gradient response gives cross-entropy an advantage. The cross-entropy loss increases rapidly as predictions deviate from the true label, providing strong gradient signals and allowing for faster learning and error correction. The squared error, by contrast, when the model predicts probabilities close to 0 or 1 (i.e., the neuron is saturated), the cost function’s gradient becomes very small. This makes it hard for the model to correct its mistakes, slowing down learning.

I hope this helps!

TMosh · September 29, 2024, 2:22pm

The cost functions are different because the goals are different.

Linear regression tries to make a model that fits the examples.
Logistic regression tries to create a boundary between the “true” and “false” examples.

paulinpaloalto · September 29, 2024, 2:57pm

Here’s another thread from a while ago which discusses this and also shows a graph of what the loss surface looks like if you use MSE for logistic regression. Sometimes a picture gets the message across better than words. It would be worth reading the earlier replies on that thread as well.

muneer321 · September 30, 2024, 11:37am

it helped however i did not understood how it works mathematically i mean when differentiating squared error which include the sigmoid function in y^ how the cost function leads to non convex function

TMosh · September 30, 2024, 2:24pm

Mathematically, a function is convex if its 2nd partial derivative is always positive or zero.

nadtriana · September 30, 2024, 5:23pm

Certainly! The squared error cost function combined with the sigmoid activation in logistic regression results in a non-convex cost function because the second derivative (Hessian) is not always positive semidefinite, it can take negative values depending on the input data and parameters.

Topic		Replies	Views
Visualizing Squared Error Cost function for Logistic regression in 2D Supervised ML: Regression and Classification week-module-3	3	642	February 17, 2024
Logistic Regression Derivative of J(w,b) Supervised ML: Regression and Classification week-module-3	12	1101	May 16, 2023
Logistic Regression: Difference between cost function & gradient descent Supervised ML: Regression and Classification week-module-3	5	602	August 8, 2022
Why is Squared Error Cost for Logistic Regression non-convex? Supervised ML: Regression and Classification week-module-3	1	612	July 31, 2022
Non-convex graph is not the reason we do not use the "Error Cost Function" as we defined in linear regression Supervised ML: Regression and Classification week-module-3	5	620	June 19, 2022

Use of squared error with sigmoid and applying gradient descent

Related topics