Can we get w and b by just having gradient equal to 0?

chaohan · June 17, 2022, 4:53pm

Hello,

The lectures introduce the gradient descent function to compute w and b at a local minimum. But
since we know that the gradient (partial derivative) is equal to 0 at the local minimum, why not just compute w and b by having the gradient equal to 0?

Thank you!

tharunnayak14 · June 17, 2022, 5:11pm

yes actually, we can directly solve for w and b using calculus but the problem is as the size of input increases finding a direct solution becomes computationally expensive, so instead we use an iterative method like gradient descent for most cases.

tharunnayak14 · June 17, 2022, 5:17pm

you can directly use the above formula to compute the weights, but as mentioned performing matrix multiplications of very high dimensional matrices is computationally expensive.

And the inverse of the term (Xtranspose.X) may not always exist.

UniqueUsername · June 17, 2022, 5:41pm

Even if you set the derivative to 0, the equation you get might not be possible to solve analytically all the time, and that’s when iterative methods like gradient descent can help to find a close best solution.

Yash_Singhal · June 18, 2022, 6:05am

In addition putting the derivative term to 0 would result in large no of critical points for high dimensional feature, most of which would be poor local minima

rmwkwok · June 18, 2022, 6:28am

Hey @chaohan , let me continue the discussion by giving examples of when w can be solved analytically as you described in the question, and when can’t.

We can: linear regression
In W1 Video (4:13) “Gradient descent for linear regression” we know the the gradients are

\frac{\partial{J}}{\partial{w}} = \frac{1}{m} \sum_{i=1}^{m}{(wx^{(i)}+b-y^{(i)})x^{(i)}}
\frac{\partial{J}}{\partial{b}} = \frac{1}{m} \sum_{i=1}^{m}{(wx^{(i)}+b-y^{(i)})}

And putting them zero as you said, we can rewrite them as
\frac{1}{m} (S_{xx}w+S_xb-S_{xy}) = 0
\frac{1}{m} (S_xw + mb - S_y) = 0

or

S_{xx}w+S_xb= S_{xy}
S_xw + mb = S_y

where S_{xx} = \sum_{i=1}^{m}{x^{(i)}x^{(i)}}, S_{xy} = \sum_{i=1}^{m}{x^{(i)}y^{(i)}}, S_{x} = \sum_{i=1}^{m}{x^{(i)}}, S_{y} = \sum_{i=1}^{m}{y^{(i)}}

We can then solve w and b with the last 2 equations thanks to the fact that both w and b can be taken out of the summation sign as common factors.

w = \frac{S_{xy} - \frac{S_xS_y}{m}}{S_{xx}-\frac{S_xS_x}{m}}
b = \frac{S_y-wS_x}{m}

@tharunnayak14 shared the matrix version of the same solution, and embedded both w and b inside \hat\theta.

We (almost always) can’t: logistic regression
In W3 video you can see that the gradients are

\frac{\partial{J}}{\partial{w}} = \frac{1}{m} \sum_{i=1}^{m}{( \frac{1}{1+ \exp(-wx^{(i)}-b)}-y^{(i)})x^{(i)}}
\frac{\partial{J}}{\partial{b}} = \frac{1}{m} \sum_{i=1}^{m}{(\frac{1}{1+ \exp(-wx^{(i)}-b)}-y^{(i)})}

Again, putting them zeros,

\sum_{i=1}^{m}{ \frac{x^{(i)}}{1+ \exp(-wx^{(i)}-b)}}-S_{xy} = 0
\sum_{i=1}^{m}{ \frac{1}{1+ \exp(-wx^{(i)}-b)}}-S_{y} = 0

This time, however, b and w can’t be taken outside of the summation sign as common factors, which makes it very challenging to analytically find the solution for w and b. I have tried to do this with m=2 (only 2 data samples) myself and if my maths is not wrong, my solution is

w = - \frac{\ln(\frac{1}{y^{(1)}}-1) - \ln(\frac{1}{y^{(2)}}-1)}{x^{(1)} - x^{(2)}}
b = -\ln(\frac{1}{y^{(1)}}-1) - wx^{(1)}

However, as m grows, it will be very difficult. I have written down my solution for m=2, but the point is, if there is no general form for the solution of w and b for all values of m, we can’t solve logistic regression like the way we can for linear regression.

This paper finds a analytical solution for logistic regression with the condition that all predictors (meaning all the x's) are categorical variables, but I have never seen an analytical solution that is generally for continuous x's.

Lastly, when there is no analytical solution, we can go to numerical solution which is what gradient descent belongs to.

chaohan · June 18, 2022, 6:35pm

Thank you @rmwkwok for the detailed example. I haven’t reached the W3 videos yet, but I do see in later W1 videos it says the normal equation method becomes slower with more features. Your example definitely helps me understand that idea better.

Topic		Replies	Views
Equate the derivative of cost to 0 zero to get the weight 'w' Supervised ML: Regression and Classification week-1	4	480	June 12, 2023
Gradient Descent Local Minima Supervised ML: Regression and Classification week-1	15	747	December 31, 2022
Determining w and b Supervised ML: Regression and Classification week-1	8	69	November 6, 2024
A general question on gradient descent wrt C1_W2 excercise Supervised ML: Regression and Classification week-2	1	529	June 22, 2022
Parameters w and b in linear regression Supervised ML: Regression and Classification week-3	7	754	September 14, 2022

Can we get w and b by just having gradient equal to 0?

Related topics