Logistic Regression Derivative of J(w,b)

Why is it that in the Gradient Descent algorithm for logistic regression, when we compute the derivative of J(w,b), we use the derivative of “Squared Error Cost Function” that we use in the Linear Regression model.

In the Logistic Regression model we have a different Cost Function as shown in attached image. Why don’t we use derivative of that cost function to update the values of parameters w and b when running gradient descent.

It’s confusing that in Logistic Regression Model the Cost Function is different from the Cost Function for which derivative is taken while running Gradient Descent.

Can anyone explain the intuition behind this?

But here, f_{w,b}(x^{(i)}) is different from the “Squared Error Cost Function”.
In linear regression, f_{w,b}(x^{(i)}) = w.x + b but in logistic regression, f_{w,b}(x^{(i)}) = \displaystyle \frac {1}{1+e^{-(w.x + b)}}

1 Like

In addition to Saif’s answer, another thing to note is that if you take the derivative of the Logistic regression cost function with respect to its parameters w and b, you will obtain derivatives that appear similar to the derivatives of the Squared Error cost function except the difference lies in the function f_{{w},b}({x}^{(i)}) used in each case, as Saif previously mentioned.

You are right, but in the earlier video where Andrew explains the Cost Function of Logistic Regression, he mentions that if we use Squared Error Cost Function in logistic regression, the resulting graph of J(w,b) vs w,b is a non-convex graph, due to which we can get stuck in some local minima and may not reach the ultimate minimum value of J(w,b), as shown in attached image.

If that’s the case, why are we using Squared Error Cost Function while running Gradient Descent?

You are right, but in the earlier video where Andrew explains the Cost Function of Logistic Regression, he mentions that if we use Squared Error Cost Function in logistic regression, the resulting graph of J(w,b) vs w,b is a non-convex graph, due to which we can get stuck in some local minima and may not reach the ultimate minimum value of J(w,b), as shown in attached image.

If that’s the case, why are we using Squared Error Cost Function while running Gradient Descent?

But we are not using the “Squared Error Cost Function” in logistic regression. If f is different, it means whole formula is different.

That figure on the right (which says “non-convex”) is exactly why we don’t use the squared error cost function for logistic regression.

When we use the correct cost function for logistic regression, it is convex.

Exactly that’s my point.

And I’m trying to understand the reason behind using Squared Error Cost Function while running gradient descent for Logistic Regression Model (as shown in my main question).

Lets see it again:

In this image, under the red marked area we can see the Cost Function that we use for Logistic Regression model, as the Squared Error Cost Function ends up having non-convex graphical property.

This means that we’ll run gradient descent on this Cost Function (under red marked area) to avoid being stuck in the local minima.

However, in the image we can see that we run gradient descent using the Squared Error Cost Function (under the blue marked area). This means that our graph of J(w,b) will be non-convex in nature and we might end up in local minima.

Why are we doing this? Instead of using derivative of Logistic Regression’s Cost Function, why are we using Squared Error Cost Function?

The intuition behind the function remains the same even if f is different.

Since, f(x) = ŷ
=> J(w,b) = (1 / 2m) Σ (ŷ - y)**2

Which means that this is a squared error cost function, regardless of the value of f.

That’s incorrect. We don’t use the squared error cost function for logistic regression.

That section of the lecture is to explain WHY we don’t use the squared error for logistic regression.

Hi @Ammar_Jawed,

It seems to me your logic is that because the equations in the blue box appear the same as the derivatives for squared cost for linear regression, then it implies that those equations are not the derivatives for logistic cost for logistic regression.

That is not correct, because their derivatives do appear the same form, and you would have seen it if you had worked out the derivatives step by step yourself. Check out this post for the steps.

Raymond

That’s interesting, thanks for sharing. It makes more sense to me now how gradient is running for logistic regression model.

Glad to hear that, @Ammar_Jawed !

Cheers,
Raymond