Normal equation vs gradient descent

vasyl.delta · January 13, 2023, 10:18am

My humble obeisances to everyone.

Prof. Andrew slightly touched the usage of normal equation to find the parameters of y=(w,x)+b regression but did not mention the details.
As I understand, we can find w by solving normal equation

(X’*X)w=X’*y (1). ( ’ stands for transpose)

But what about finding parameter b? I tried to get formula for it by taking derivative of LS criterion, but it did not work, all the terms are cancelled.

Intuitively, I see that b can be found as mean(Y-X*w), but not sure that it is the optimal solution. Will be grateful for the answer.

Besides, does equation (1) work only for the centered (mean-value subtracted) vectors x?

My best regards,
Vasyl

NILESH_RANJAN_PAL · January 13, 2023, 10:34am

You can refer this , I got help from looking here.

rmwkwok · January 13, 2023, 11:35am

Hello @vasyl.delta,

You may just use the same normal equation (1) for solving the bias. Assume you have n features, then in the dataset X, you need to add one extra feature of which the value is always 1.

As for the parameters vector \theta, you need to add one more element at the end.

This is how we always use the normal equation to solve for both the weights and the bias.

It works for both centered or non-centered X.

Cheers,
Raymond

vasyl.delta · January 16, 2023, 9:13am

Thank you so much, Raymond!
Looks convincingly, will try.

rmwkwok · January 16, 2023, 9:22am

@vasyl.delta

Cool!

Raymond

vasyl.delta · June 9, 2023, 4:49am

One more question concerning linear regression and normal equation.
The gradient descent can theoretically lead us to some of local minima. However, the normal equation gives us only one solution to the linear regression problem. Does this mean that there is no problem with multiple local minima for linear regression, i.e. only one minimum exists?
WIll be grateful for the answer,
Vasyl.

Mujassim_Jamal · June 9, 2023, 5:09am

Linear regression does not have multiple local minima, but rather a single global minimum due to the convex property of its cost function.

vasyl.delta · June 9, 2023, 8:29am

Thank you very much for the answer!
I had some doubts because Prof. Andrew NG mentioned the problem of multiple local minima.

vasyl.delta · June 9, 2023, 8:31am

Could you please tell if the issue of multiple local minima exists for Logistic linear regression? It seems that the cost function is no longer convex there.

Mujassim_Jamal · June 9, 2023, 9:15am

No.

Prof. Andrew demonstrated that employing mean squared error (MSE) as the cost function for logistic regression yields a non-convex surface. However, using an appropriate cost function, such as log loss, for logistic regression results in a smooth convex surface.

vasyl.delta · June 9, 2023, 1:28pm

Thank you. So, it means that once the gradient descent method converges, it will produce the global minimum of cost function?

Mujassim_Jamal · June 9, 2023, 2:39pm

Yes, once gradient descent converges on a convex function, it will end up at the global minimum.

vasyl.delta · June 15, 2023, 5:49am

Let me please ask also another question.
I am plotting in 3-D cost function depending on parameters of linear regression: w,b
However the plot does not look like convex function with well seen minimum… Yes, the minimum is there, but it is not distinct and seems to be “continuated” along axis b.
Did I do something wrong or such a plt must indeed look like this?

Figure_loss_func

Mujassim_Jamal · June 15, 2023, 6:33am

Surfaces of this type may exist, and their existence depends on the type of cost function used for a regression problem other than MSE.

vasyl.delta · June 15, 2023, 9:11am

Thank you very much for the answer!
I used the standard MSE loss function
\sum_k (y_k-w*x_k-b)^2

Mujassim_Jamal · June 15, 2023, 9:18am

It seems like your formula for MSE is incorrect.

vasyl.delta · June 15, 2023, 9:54am

Could you please have a look on how I plotted it?

#PLot surface
w_=np.linspace(-0.1,0.1,100)
b_=np.linspace(40,60,500)

W, B = np.meshgrid(w_, b_)

sx = np.sum(x)
sy = np.sum(y)
sx2 = np.sum(np.square(x))
sy2 = np.sum(np.square(y))
sxy = np.sum(x*y)

Z = sx2*(W2) +2W(sxB-sxy)+N(B2)+sy2-2syB

fig, ax = plt.subplots(subplot_kw={“projection”: “3d”})
surf = ax.plot_surface(W, B, Z)#, cmap=cm.coolwarm, linewidth=0, antialiased=False)
plt.xlabel(‘w’)
plt.ylabel(‘b’)
plt.show()

Mujassim_Jamal · June 16, 2023, 8:26am

What is the reason for doing this?

And why are you plotting parameters against Z values instead of plotting them against the cost?

I don’t fully understand the purpose of your code.

vasyl.delta · June 16, 2023, 10:29am

Thank you very much for being patient with my question.
Please advise me how to plot the MSE cost in a correct way.

May be you can kindly suggest an example

Mujassim_Jamal · June 16, 2023, 1:38pm

Here are certain steps you could follow by replacing the above line:

Iterate over all set of values for w_ and b_.
- Calculate the squared error for your input x.
- Store the squared error in a cost matrix.

Finally, you could plot the parameters and cost values.

Topic		Replies	Views
Parameters w and b in linear regression Supervised ML: Regression and Classification week-3	7	754	September 14, 2022
Machine learning specialization week 1 Supervised ML: Regression and Classification week-1	2	408	August 22, 2023
Why don't we get the minimum of a function mathematically instead of running gradient descent? Supervised ML: Regression and Classification week-1	4	556	August 8, 2022
How is OLS different from the normal linear regression? AI Discussions	13	158	February 1, 2023
Can we get w and b by just having gradient equal to 0? Supervised ML: Regression and Classification week-1	6	617	June 18, 2022

Normal equation vs gradient descent

Related topics