How is OLS different from the normal linear regression?

tbhaxor · January 30, 2023, 11:02am

I am learning a new statistical way other than y = mx + b where we use gradient descent to optimize the m and b. OLS - ordinary least square which is used to give weights as W = (X^TX)^{-1}X^Ty

both of them have same cost function: MSE

AbdElRhaman_Fakhry · January 30, 2023, 11:43am

Hi @tbhaxor

Yes , It’s called normal equation some times used instead of gradient descent but the disadvantage of this techniques are:

It’s useful and faster when the number of features and number of training is small(one of the reasons is we didn’t want to set learning rate alpha ) but if versa The inverse of the matrix would be very slow and hard (inefficient in time)
May it rise an error and you didn’t know where is it from. that may cause the matrix you want to get an inverse to it is singular it mean that the matrix couldn’t get an inverse for it that’s because may you have 2 features dependent on each other without you didn’t know or may raise error because number of feature greater that number of training examples

Cheers,
Abdelrahman

tbhaxor · January 30, 2023, 11:50am

It makes sense. Also all features selected must have the regression? For example if we have categorical column will it work in that? My guess is no (but dont know exactly )

AbdElRhaman_Fakhry · January 30, 2023, 1:47pm

In could work if you encoded these features but you may fall on the dependent columns

tbhaxor · February 1, 2023, 11:12am

It is not clear to me, could you explain it with example (if possible)

AbdElRhaman_Fakhry · February 1, 2023, 11:25am

@tbhaxor

I mean that if you have 2 categorical features(columns) and first one is for example [cat,cat,dog,dog,cat] we represent the cat with 0 and dog with 1 so it would be [0,0,1,1,0] and the another column is for example [down,down,up,up,down] so that if you encoded this column with down = 0 and up =1 it would be [0,0,1,1,0] that’s equal the first column, and it would make dependent columns in the calculation of the normal equation, that’s would lead to error as the matrix of input features would be singular it mean that the matrix couldn’t get an inverse for it that’s because may you have 2 features dependent on each other…

But Note

This case may be very very rarely case but it could happen

Cheers,
Abdelrahman

Michalis_Frangos · February 1, 2023, 12:39pm

A note regarding inverting large matrices. Linear algebra libraries do not invert large matrices. There are efficient iterative methods to “solve” large scale linear equations
In other words for y = AX , A is not computed directly as $$yX^T(XX^T)^{-1}$$ (and there is no matrix inversion) but the linear equation is solved iteratively and efficiently. e.g. see GMRES, Krylov methods etc.

tbhaxor · February 1, 2023, 12:47pm

The approach of OLS is used when the b is not in column space of A, so we need to orthogonal project b onto A. Which also means the b not lies in the span of A, so find some vector close to b which lies in the span of A. for this we use this technique in stats, but in machine learning more advanced gradient decent is used.

Thanks for sharing this

tbhaxor · February 1, 2023, 12:49pm

Yes we can not say always but chances are det(A) will be 0 thus breaking the code that is why only real numerical column are selected (feature selection) and then normalized?

Michalis_Frangos · February 1, 2023, 12:54pm

If you use the same data for both methods, and the problem given these data is ill-posed then the problem I believe is ill-posed for both methods (e.g., if your matrices lose rank). The optimization approach might not converge. Hence you apply constraints, e.g. regularization, and yes choosing your features carefully helps (to avoid losing rank in your matrices etc).

AbdElRhaman_Fakhry · February 1, 2023, 1:24pm

@tbhaxor
Feature selection is the process of isolating the most consistent, non-redundant, and relevant features to use in model construction. so that I think it would be a good choice when you used normal equation technique

Michalis_Frangos · February 1, 2023, 1:40pm

yes indeed, this holds for whichever approach you follow.

tbhaxor · February 1, 2023, 1:50pm

Makes sense, thanks for confirming

Isaac_Awotwe · February 1, 2023, 11:56pm

Linear Regression is a more general concept that involves solving the equation y=a+bX+e. where y is the dependent variable, X is an independent variable or a number of independent variables. The goal is to estimate the parameters a and b. (note that e is an error term).

Ordinary Least Squares is just one way of estimating a and b which involves minimizing the MSE by determining the a and b that minimises the sum of squared errors.

There are other methods for estimating a and b, for example Maximum Likelihood Estimation (MLE) in which you choose the values of a and b that maximises some likelihood function.

Bottom line: Linear Regression is the equation y-a+bX+e that theorizes a relationship between y and X. OLS is just one method of solving that equation, and there are other ways of solving it that are not OLS.

Topic		Replies	Views
C1W3_Assignment (Linear Algebra) Linear Algebra for Machine Learning and Data Sc... week-module-3	1	65	April 11, 2025
Normal equation vs gradient descent Supervised ML: Regression and Classification week-module-2	23	693	June 22, 2023
Why using Gradient Descent Supervised ML: Regression and Classification week-module-1	1	425	August 15, 2023
Does gradient descent of cost function give the same regression line as ordinary least squares? Supervised ML: Regression and Classification week-module-1	5	537	September 27, 2022
Week 2 Community Contributions: Share Your Notes Supervised ML: Regression and Classification week-module-2	17	652	July 8, 2022

How is OLS different from the normal linear regression?

Related topics