In linear regression do we care about independent variable distribution?

Letsay most of the values cluster around similar set of values - and there are few values outside. Cost function averages square error so even large contribution from outliers will not affect it. So the line will most like predict wrong values once we are try to use it from a cluster.

1 Like

One solution to this issue is to create a more complex set of features (using non-linear combinations, or exponents), so that those clusters can be more closely modeled.

1 Like

As @TMosh has mentioned, we can have more complicated set of features to be able to capture those outliers.

On the flip side, lets also keep an eye on the aspect of overfitting - Should these outliers be considered as anomalies and ignored in the model OR are they worthy of the extra effort of having a more complicated model, such that the model can correctly predict these outliers as well.

1 Like