Heteroscedasticity, Multicollinearity, Diagnostics

Hello everyone,
I would like to express how confused I am about implementing Regression models using ML. As I have a statistics background, we usually use OLS to find the best regression model that includes only the significant variables (features) and drops unsignificant ones, we can easily do that by comparing the p-value with the significance level (Alpha). this is very understandable.
The problem I have is in the diagnostics part, in OLS regression we have to verify if some assumptions are valid or not, if they are valid then we can use the regression model to conduct predictions, some of these assumptions are :

  • Homoscedasticity : assumes that the variance of the residuals is constant across all levels of the independent variables.
  • No multicollinearity: Multicollinearity occurs when there is a high correlation between independent variables.
  • No endogeneity: Endogeneity refers to a situation where there is a correlation between the independent variables and the error term.

The course didn’t mention these assumptions, is that okay ? for now I’m confused, if ML is way different than traditional statistical models or if they are the same in the diagnostics part !!!

Thank you.

Hi @Yassine_H,

In my opinion, I believe the courses cover linear regression and logistic regression for the sake of introducing gradient descent which is crucial to the main dish - Neural network.

If you go through the menus of the courses, starting right from the beginning, we actually pretty quickly jump into gradient descent, learning rate, cost function, feature scaling and so on. They are all preparing us for Neural networks.

Given Neural network as the goal (again, in my opinion), I think a formal discussion of the traditional approach is not really quite the biggest force that moves us directly towards that goal.

Furthermore, the neural network approach is not identical to the traditional approach, but our focus should be on the former. Therefore, I think anyone who decided to learn about the traditional approach first might need to look for other courses, and then come back here later for a more neural network approach.

Welcome to the community, and cheers,
Raymond

Your concerns are not unusual for folks with a statistics background when they first approach ML. You’re used to applying “human learning”, but here the machine does most of that.

ML methods are much different than statistics, but they often reach a similar goal.

@Yassine_H
Hi
I can understand the concerns raised.
A purist mathematical statistical modeling assumes your mentioned criteria and even more when optimizing the parameters especially temporal models.
ML on the other hand does take these into consideration (at a much later stage) but as mentioned by @rmwkwok consider this from pov of laying a foundation for optimization and eventual ‘learning’ of the model. Having said this, you can still incorporate data diagnostics from your own background and then try learning the model, shouldnt be a deterrent at all.
I had similar doubts when I started…I come from Temporal Spatial Mathematical modeling of Biological signaling systems. I kind of kept aside the concepts I am groomed with to understand ML basics as it is.