Query about Squared Error Cost Function

Can anyone please tell me why are we squaring the difference between the “Prediction” and “target”? It makes sense that we are summing up all the errors for every “i”th training set and taking the average of that.

But the reason we are taking average is to avoid a bigger calculation of the error given there are thousands of training set, so isn’t doing the square results in bigger number too. Why are we doing square in the first place? Is it because so that we can get the difference between “y-hat” and “y” always positive?

?

Squaring the errors allows several benefits:

  • Both positive and negative errors are handled.
  • The contribution from very large errors is emphasized.
  • The squared-error cost function is known to be convex, so there will only be one minimum.
3 Likes