For the loss function, can we not take the absolute value instead of taking the square ? Taking absolute value is simpler than taking square from a computation perspective

Hello @Sanjaya_Nagabhushan , I think the reason **Mean Squared Error** is commonly used is because of its tendency to penalize large error more than small errors (It have so much impact on outliers) and also it eliminates possibilities of getting negative values.

Thanks @Isaak_Kamau. Could you please elaborate with an example what exactly do you mean when you say ‘penalize large error’ ?

Hello, @Sanjaya_Nagabhushan I am sorry I lost this tread.

I meant “Squaring emphasizes larger differences” just imagine squaring a loss function of a number like 2 you will get 4(Less Impact) but when you square a loss function of a number like 10 you’d get 100 (huge impact) Remember when we are training a model the goal is to reduce the loss function (Predicted value - Real value) so the mean squared error I think it will much help us in spotting outliers and huge loss functions. Please refer to this thread for more link

Let me know if you have another question.

Happy Learning

Isaak Kamau

And squaring a number like 0.1 gives you 0.01. So it also *deemphasizes* small errors. This also helps to make the quadratic loss function perform better than using the absolute value of the difference.

There are some cases in which people do use absolute value as the loss function, e.g. Lasso, but for things like linear regression quadratic loss is a clear win.

The other way to see the implications of the behavior is to consider the derivatives. If you use:

f(z) = \displaystyle \frac {1}{2} z^2

Then f'(z) = z, of course. Whereas for

g(z) = |z|

the derivative is -1 for z < 0, +1 for z > 0 and undefined at z = 0, although it turns out in practice that the non-differentiability at 0 is not a problem.

So think about the implications of those derivatives for how the gradients will work to push the parameters in the direction of a better solution:

In the quadratic case, the “force” of the correction supplied by the gradients is exactly proportional to the magnitude of the error.

In the absolute value case, the “force” of the correction is blind to the magnitude of the error.

@paulinpaloalto This is great. Had never thought of it from this perspective, @Sanjaya_Nagabhushan Hope now you have a better understanding of your question?