Why the linear regression and classification have identical Gradient Function?

For the linear regression we identified the cost function to be 1/2m of sum of squares of the differences of actual and predicted values.
Then the gradient function was calculated by taking the partial derivative of this cost function by “w” and by “b” – and we got the gradient equation for w and b.

However, for the classification we identified the cost function which was something along the lines of: 1/m SUM of -yi log f - (1-yi)log(1-f)
Now f is a sigmoid function of wx+b.
So why didn’t we compute the gradient by taking the partial derivative wrt to “w” and “b” of the loss function above?

Instead – both the linear regression and classification problems are reported to have identical gradient function:
dJ/dwj = SIGMA (f - yi) xj

I wish I had the ability to cut-n-paste the screenshots or to capture equations – that would have made by life easier.

2 Likes

Hello @rkranjan,

For both linear regression and logistic regression, their derivatives of cost with respect to w and b happen to have the same form. Did you try to carry out the derivatives yourself? Here is a very similar discussion.

Raymond

Hello Raymond-

Thank you for your response.

Indeed I tried to do the detailed derivative myself. I must have made some mistake somewhere.

It may have been mentioned by Andrew in the course that the detailed mathematical calculation results in the same form. But I missed it.

Should we infer something more general from this? How did the two loss functions – that look so different – ended up giving the identical gradient form? What are the other loss functions that may result in the same outcome?

Thank you once again.

Hello @rkranjan,

I will let you decide whether you want to do the research :wink: You might look for some loss functions, and then take the derivative, and see what they end up. Making a table to summarize them would be wonderful. Your call.

However, we can rewrite the cost gradients for linear regression and logistic regression into this

image

which clearly shows us that the gradients are somehow proportional to the error, which makes a lot of sense, because in the other words, if the error is zero, the gradients are zero. This amazing property aligns with our intuition, doesn’t it?

Certainly it is an interesting fact that they share the same look! However, from their respective loss functions, we could also have a glimpse of that:

Linear regression, z is model prediction.

Logistic regression, p is model prediction.

Even though both of them do have the error term, they don’t actually look similar, do they? Unless we want to engineer a function g such that p = g(z), because in this way,

Logistic regression, where p = g(z)

While we have the freedom to engineer any g, what is better than a g that ends up making the bracketed term be 1? Because:-

  1. we get rid of the denominator
  2. this implies \frac{\partial{p}}{\partial{z}} = p(1-p) which again has a nice property that as p approaches to 1 or 0, this gradient approaches to 0
  3. it gives us the look of linear regression’s

It turns out that if we solve this equation \frac{\partial{p}}{\partial{z}} = p(1-p) by integration, we find that g is just our very familiar sigmoid function. It is only because we choose the sigmoid function as our g, we make the loss gradient of our logistic regression to look very similar to that of the linear regression. While there is no law in the nature that prohitbits any one from choosing another g, but if that person does, their loss gradient will no longer look like the linear regression’s.

Lastly, I don’t claim this was how historically the sigmoid had come into logistic regression, and I have never read that history either. These are just some logical statements. :slight_smile:

If you choose to share it with us here, we can take a look!

Cheers,
Raymond

Wow Raymond. Thank you very much for this very insightful response.

You are welcome, @rkranjan!