Hey @chaohan , let me continue the discussion by giving examples of when w can be solved analytically as you described in the question, and when can’t.
We can: linear regression
In W1 Video (4:13) “Gradient descent for linear regression” we know the the gradients are
\frac{\partial{J}}{\partial{w}} = \frac{1}{m} \sum_{i=1}^{m}{(wx^{(i)}+b-y^{(i)})x^{(i)}}
\frac{\partial{J}}{\partial{b}} = \frac{1}{m} \sum_{i=1}^{m}{(wx^{(i)}+b-y^{(i)})}
And putting them zero as you said, we can rewrite them as
\frac{1}{m} (S_{xx}w+S_xb-S_{xy}) = 0
\frac{1}{m} (S_xw + mb - S_y) = 0
or
S_{xx}w+S_xb= S_{xy}
S_xw + mb = S_y
where S_{xx} = \sum_{i=1}^{m}{x^{(i)}x^{(i)}}, S_{xy} = \sum_{i=1}^{m}{x^{(i)}y^{(i)}}, S_{x} = \sum_{i=1}^{m}{x^{(i)}}, S_{y} = \sum_{i=1}^{m}{y^{(i)}}
We can then solve w and b with the last 2 equations thanks to the fact that both w and b can be taken out of the summation sign as common factors.
w = \frac{S_{xy} - \frac{S_xS_y}{m}}{S_{xx}-\frac{S_xS_x}{m}}
b = \frac{S_y-wS_x}{m}
@tharunnayak14 shared the matrix version of the same solution, and embedded both w and b inside \hat\theta.
We (almost always) can’t: logistic regression
In W3 video you can see that the gradients are
\frac{\partial{J}}{\partial{w}} = \frac{1}{m} \sum_{i=1}^{m}{( \frac{1}{1+ \exp(-wx^{(i)}-b)}-y^{(i)})x^{(i)}}
\frac{\partial{J}}{\partial{b}} = \frac{1}{m} \sum_{i=1}^{m}{(\frac{1}{1+ \exp(-wx^{(i)}-b)}-y^{(i)})}
Again, putting them zeros,
\sum_{i=1}^{m}{ \frac{x^{(i)}}{1+ \exp(-wx^{(i)}-b)}}-S_{xy} = 0
\sum_{i=1}^{m}{ \frac{1}{1+ \exp(-wx^{(i)}-b)}}-S_{y} = 0
This time, however, b and w can’t be taken outside of the summation sign as common factors, which makes it very challenging to analytically find the solution for w and b. I have tried to do this with m=2 (only 2 data samples) myself and if my maths is not wrong, my solution is
w = - \frac{\ln(\frac{1}{y^{(1)}}-1) - \ln(\frac{1}{y^{(2)}}-1)}{x^{(1)} - x^{(2)}}
b = -\ln(\frac{1}{y^{(1)}}-1) - wx^{(1)}
However, as m grows, it will be very difficult. I have written down my solution for m=2, but the point is, if there is no general form for the solution of w and b for all values of m, we can’t solve logistic regression like the way we can for linear regression.
This paper finds a analytical solution for logistic regression with the condition that all predictors (meaning all the x's) are categorical variables, but I have never seen an analytical solution that is generally for continuous x's.
Lastly, when there is no analytical solution, we can go to numerical solution which is what gradient descent belongs to.