Hi,
I just finished watching the series on the cost function. I understand the need to square the error to avoid negative errors. But is it a problem that by using the square error we will bias the fit for high values?
For example, say we have a data set x = [1,100] y = [2,104] according to squared error a fit of y_hat = [1, 104] is much better then a fit of y_hat = [2,100], even though that in the first case the y_hat value for when x = 1 has an error of 200%, while, when x = 100 and y_hat = 100, the error is only 4%.
Would it not make sense to use percentages or some other metric that avoids this issue, and is this an issue at all, or do we use this metric just because it is easy to minimisize? Thank you so much for any help!
@daniel6
Hi Daniel
As you said the square error is one of the methods that is used to measure how our model is performing and for training the model as in the Gradient Descent also there is the absolute method which these two methods are the most common way to measure the performance of the model and also to train the model (which is done by minimizing that cost).
Now why do we use the squared error rather than the absolute error the reason comes from the algorithm of the Gradient Descent itself as we need to compute the derivative of the cost function in order to minimize the error as
x^{i+1} = x^{i} - \alpha \frac{\partial J}{\partial x}
finding the derivative is much easier to do for the squared error rather than for the absolute error which is that the derivative of the absolute error is not known at the zero point.
Now for using the squared values it does not matter as long as we are decreasing is the error and minimizing the function.
1 Like
That makes sense! Thank you very much!
1 Like
@daniel6 I hope I had made it clear to you.
Hello Daniel,
I just want to add that your concern is real, so if we want to stick with the squared loss, it’d be better for us to somehow screen out those “outliners”, or collect more data to balance out the outliners.
Raymond