How is OLS different from the normal linear regression?

I am learning a new statistical way other than y = mx + b where we use gradient descent to optimize the m and b. OLS - ordinary least square which is used to give weights as W = (X^TX)^{-1}X^Ty

both of them have same cost function: MSE

Hi @tbhaxor

Yes , It’s called normal equation some times used instead of gradient descent but the disadvantage of this techniques are:

  • It’s useful and faster when the number of features and number of training is small(one of the reasons is we didn’t want to set learning rate alpha ) but if versa The inverse of the matrix would be very slow and hard (inefficient in time)
  • May it rise an error and you didn’t know where is it from. that may cause the matrix you want to get an inverse to it is singular it mean that the matrix couldn’t get an inverse for it that’s because may you have 2 features dependent on each other without you didn’t know or may raise error because number of feature greater that number of training examples


1 Like

It makes sense. Also all features selected must have the regression? For example if we have categorical column will it work in that? My guess is no (but dont know exactly :sweat_smile: )

In could work if you encoded these features but you may fall on the dependent columns

It is not clear to me, could you explain it with example (if possible)


I mean that if you have 2 categorical features(columns) and first one is for example [cat,cat,dog,dog,cat] we represent the cat with 0 and dog with 1 so it would be [0,0,1,1,0] and the another column is for example [down,down,up,up,down] so that if you encoded this column with down = 0 and up =1 it would be [0,0,1,1,0] that’s equal the first column, and it would make dependent columns in the calculation of the normal equation, that’s would lead to error as the matrix of input features would be singular it mean that the matrix couldn’t get an inverse for it that’s because may you have 2 features dependent on each other…

But Note

This case may be very very rarely case but it could happen


1 Like

A note regarding inverting large matrices. Linear algebra libraries do not invert large matrices. There are efficient iterative methods to “solve” large scale linear equations
In other words for y = AX , A is not computed directly as $$yX^T(XX^T)^{-1}$$ (and there is no matrix inversion) but the linear equation is solved iteratively and efficiently. e.g. see GMRES, Krylov methods etc.

1 Like

The approach of OLS is used when the b is not in column space of A, so we need to orthogonal project b onto A. Which also means the b not lies in the span of A, so find some vector close to b which lies in the span of A. for this we use this technique in stats, but in machine learning more advanced gradient decent is used.

Thanks for sharing this

Yes we can not say always but chances are det(A) will be 0 thus breaking the code :smile: that is why only real numerical column are selected (feature selection) and then normalized?

If you use the same data for both methods, and the problem given these data is ill-posed then the problem I believe is ill-posed for both methods (e.g., if your matrices lose rank). The optimization approach might not converge. Hence you apply constraints, e.g. regularization, and yes choosing your features carefully helps (to avoid losing rank in your matrices etc).

1 Like

Feature selection is the process of isolating the most consistent, non-redundant, and relevant features to use in model construction. so that I think it would be a good choice when you used normal equation technique

yes indeed, this holds for whichever approach you follow.

Makes sense, thanks for confirming

Linear Regression is a more general concept that involves solving the equation y=a+bX+e. where y is the dependent variable, X is an independent variable or a number of independent variables. The goal is to estimate the parameters a and b. (note that e is an error term).

Ordinary Least Squares is just one way of estimating a and b which involves minimizing the MSE by determining the a and b that minimises the sum of squared errors.

There are other methods for estimating a and b, for example Maximum Likelihood Estimation (MLE) in which you choose the values of a and b that maximises some likelihood function.

Bottom line: Linear Regression is the equation y-a+bX+e that theorizes a relationship between y and X. OLS is just one method of solving that equation, and there are other ways of solving it that are not OLS.