Can we get w and b by just having gradient equal to 0?


The lectures introduce the gradient descent function to compute w and b at a local minimum. But
since we know that the gradient (partial derivative) is equal to 0 at the local minimum, why not just compute w and b by having the gradient equal to 0?

Thank you!

1 Like

yes actually, we can directly solve for w and b using calculus but the problem is as the size of input increases finding a direct solution becomes computationally expensive, so instead we use an iterative method like gradient descent for most cases.

you can directly use the above formula to compute the weights, but as mentioned performing matrix multiplications of very high dimensional matrices is computationally expensive.

And the inverse of the term (Xtranspose.X) may not always exist.

Even if you set the derivative to 0, the equation you get might not be possible to solve analytically all the time, and that’s when iterative methods like gradient descent can help to find a close best solution.

1 Like

In addition putting the derivative term to 0 would result in large no of critical points for high dimensional feature, most of which would be poor local minima

Hey @chaohan , let me continue the discussion by giving examples of when w can be solved analytically as you described in the question, and when can’t.

We can: linear regression
In W1 Video (4:13) “Gradient descent for linear regression” we know the the gradients are

\frac{\partial{J}}{\partial{w}} = \frac{1}{m} \sum_{i=1}^{m}{(wx^{(i)}+b-y^{(i)})x^{(i)}}
\frac{\partial{J}}{\partial{b}} = \frac{1}{m} \sum_{i=1}^{m}{(wx^{(i)}+b-y^{(i)})}

And putting them zero as you said, we can rewrite them as
\frac{1}{m} (S_{xx}w+S_xb-S_{xy}) = 0
\frac{1}{m} (S_xw + mb - S_y) = 0


S_{xx}w+S_xb= S_{xy}
S_xw + mb = S_y

where S_{xx} = \sum_{i=1}^{m}{x^{(i)}x^{(i)}}, S_{xy} = \sum_{i=1}^{m}{x^{(i)}y^{(i)}}, S_{x} = \sum_{i=1}^{m}{x^{(i)}}, S_{y} = \sum_{i=1}^{m}{y^{(i)}}

We can then solve w and b with the last 2 equations thanks to the fact that both w and b can be taken out of the summation sign as common factors.

w = \frac{S_{xy} - \frac{S_xS_y}{m}}{S_{xx}-\frac{S_xS_x}{m}}
b = \frac{S_y-wS_x}{m}

@tharunnayak14 shared the matrix version of the same solution, and embedded both w and b inside \hat\theta.

We (almost always) can’t: logistic regression
In W3 video you can see that the gradients are

\frac{\partial{J}}{\partial{w}} = \frac{1}{m} \sum_{i=1}^{m}{( \frac{1}{1+ \exp(-wx^{(i)}-b)}-y^{(i)})x^{(i)}}
\frac{\partial{J}}{\partial{b}} = \frac{1}{m} \sum_{i=1}^{m}{(\frac{1}{1+ \exp(-wx^{(i)}-b)}-y^{(i)})}

Again, putting them zeros,

\sum_{i=1}^{m}{ \frac{x^{(i)}}{1+ \exp(-wx^{(i)}-b)}}-S_{xy} = 0
\sum_{i=1}^{m}{ \frac{1}{1+ \exp(-wx^{(i)}-b)}}-S_{y} = 0

This time, however, b and w can’t be taken outside of the summation sign as common factors, which makes it very challenging to analytically find the solution for w and b. I have tried to do this with m=2 (only 2 data samples) myself and if my maths is not wrong, my solution is

w = - \frac{\ln(\frac{1}{y^{(1)}}-1) - \ln(\frac{1}{y^{(2)}}-1)}{x^{(1)} - x^{(2)}}
b = -\ln(\frac{1}{y^{(1)}}-1) - wx^{(1)}

However, as m grows, it will be very difficult. I have written down my solution for m=2, but the point is, if there is no general form for the solution of w and b for all values of m, we can’t solve logistic regression like the way we can for linear regression.

This paper finds a analytical solution for logistic regression with the condition that all predictors (meaning all the x's) are categorical variables, but I have never seen an analytical solution that is generally for continuous x's.

Lastly, when there is no analytical solution, we can go to numerical solution which is what gradient descent belongs to.


Thank you @rmwkwok for the detailed example. I haven’t reached the W3 videos yet, but I do see in later W1 videos it says the normal equation method becomes slower with more features. Your example definitely helps me understand that idea better.

1 Like