Normal equation vs gradient descent

My humble obeisances to everyone.

Prof. Andrew slightly touched the usage of normal equation to find the parameters of y=(w,x)+b regression but did not mention the details.
As I understand, we can find w by solving normal equation

(X’*X)w=X’*y (1). ( ’ stands for transpose)

But what about finding parameter b? I tried to get formula for it by taking derivative of LS criterion, but it did not work, all the terms are cancelled.

Intuitively, I see that b can be found as mean(Y-X*w), but not sure that it is the optimal solution. Will be grateful for the answer.

Besides, does equation (1) work only for the centered (mean-value subtracted) vectors x?

My best regards,

You can refer this , I got help from looking here.


You may just use the same normal equation (1) for solving the bias. Assume you have n features, then in the dataset X, you need to add one extra feature of which the value is always 1.

As for the parameters vector \theta, you need to add one more element at the end.

This is how we always use the normal equation to solve for both the weights and the bias.

It works for both centered or non-centered X.


Thank you so much, Raymond!
Looks convincingly, will try.



One more question concerning linear regression and normal equation.
The gradient descent can theoretically lead us to some of local minima. However, the normal equation gives us only one solution to the linear regression problem. Does this mean that there is no problem with multiple local minima for linear regression, i.e. only one minimum exists?
WIll be grateful for the answer,

Linear regression does not have multiple local minima, but rather a single global minimum due to the convex property of its cost function.

Thank you very much for the answer!
I had some doubts because Prof. Andrew NG mentioned the problem of multiple local minima.

Could you please tell if the issue of multiple local minima exists for Logistic linear regression? It seems that the cost function is no longer convex there.


Prof. Andrew demonstrated that employing mean squared error (MSE) as the cost function for logistic regression yields a non-convex surface. However, using an appropriate cost function, such as log loss, for logistic regression results in a smooth convex surface.

Thank you. So, it means that once the gradient descent method converges, it will produce the global minimum of cost function?

Yes, once gradient descent converges on a convex function, it will end up at the global minimum.

Let me please ask also another question.
I am plotting in 3-D cost function depending on parameters of linear regression: w,b
However the plot does not look like convex function with well seen minimum… Yes, the minimum is there, but it is not distinct and seems to be “continuated” along axis b.
Did I do something wrong or such a plt must indeed look like this?


Surfaces of this type may exist, and their existence depends on the type of cost function used for a regression problem other than MSE.

Thank you very much for the answer!
I used the standard MSE loss function
\sum_k (y_k-w*x_k-b)^2

It seems like your formula for MSE is incorrect.

Could you please have a look on how I plotted it?

#PLot surface

W, B = np.meshgrid(w_, b_)

sx = np.sum(x)
sy = np.sum(y)
sx2 = np.sum(np.square(x))
sy2 = np.sum(np.square(y))
sxy = np.sum(x*y)

Z = sx2*(W2) +2W(sxB-sxy)+N(B2)+sy2-2syB

fig, ax = plt.subplots(subplot_kw={“projection”: “3d”})
surf = ax.plot_surface(W, B, Z)#, cmap=cm.coolwarm, linewidth=0, antialiased=False)

What is the reason for doing this?

And why are you plotting parameters against Z values instead of plotting them against the cost?

I don’t fully understand the purpose of your code.

Thank you very much for being patient with my question.
Please advise me how to plot the MSE cost in a correct way.

May be you can kindly suggest an example

Here are certain steps you could follow by replacing the above line:

  • Iterate over all set of values for w_ and b_.
    • Calculate the squared error for your input x.
    • Store the squared error in a cost matrix.

Finally, you could plot the parameters and cost values.