Week 3: Gradient Descent Implementation

For logistic regression The cost function is different than the squared error cost function (used for linear regression). Then how come when computing the gradient descent of logistic regression cost function, the gradient descent of linear regression cost function is calculated. I understand the f(x) is different for linear regression (line function) and for logistic regression (sigmoid function).

GRadient descent for Logistic regression:

Please clarify.

The reason the update rule looks identical is due to how the Chain Rule interacts with the Sigmoid function.

When we calculate the derivative of the Logistic Cost function d/dwj J(w, b), we break it down into three parts:

  1. How the Cost changes with respect to the Prediction (f).

  2. How the Prediction (f) changes with respect to the Linear sum (z = w.x + b).

  3. How the Linear sum (z) changes with respect to the Weight (wj).

    The Step-by-Step Derivation:

    If we look at a single training example, the derivatives are:

    1. Cost w.r.t Prediction: dJ / df = (f - y) / f (1 - f)
    2. Prediction w.r.t z (Sigmoid Derivative): df / dz = f (1 - f)
    3. z w.r.t Weight: dz / dwj = xj

    When you multiply these together using the Chain Rule:

    dJ / dwj = dJ / df . df / dz . dz / dwj

    dJ / dwj = ( ( f - y ) / ( f ( 1 - f) ) ) . f ( 1 - f ) . xj

    The f (1 - f) terms cancel out perfectly, leaving you with the familiar term:

    dJ / dwj = (f - y) xj

    Linear Regression Prediction formula - w . x + b (Any real number)
    Logistic Regression Prediction formula - 1 / ( 1 + e ^ -( w . x + b ))

    If you tried to use the Linear Regression “Squared Error” cost function for Logistic Regression, the resulting math would involve more complex derivatives that are non-convex. This would create a “bumpy” surface with many local minima, making it nearly impossible for Gradient Descent to find the global best solution.

    By using the Log Loss function shown in your image, we ensure the gradient simplifies beautifully and the “bowl” remains convex, ensuring we can find the optimal weights.

3 Likes

Here’s another historical thread that shows the calculation of the derivatives in the Logistic Regression/sigmoid/cross entropy loss case.

1 Like

Thank you @sanjaypsachdev for the clarification and reminding me the concept of chain rule in derivatives.

Thank you @paulinpaloalto for further reference.