What is the logic behind taking squares of error distance and not absolute value, as both get rid of negative sign. Since we are taking average squared error to overcome large calculations ( as prof said) , we could have just taken absolute values. Please help. Thank you

We use the square of the errors distance because:

- We need the cost function to have a continuous partial derivative, because that’s how we find the gradients that we need for gradient descent. The absolute value does not have a continuous partial derivative.
- The squared error cost function has a very simple partial derivative which is easily computed.
- The squared error cost function emphasizes the correction of large magnitude errors.

Thank you.

So this is what I understood - Since the absolute value function is not differentaiable at x=0 , therefore we are using squared error ( a parabola) .Is this correct ?

That is only one of the reasons.

Hello!

In my opinion, I think we would be going too far to judge why squaring but not taking absolute value here, given that @Nikhilesh didn’t even mention what problem is at hand.

For the purpose of the course, given the great properties that Tom mentioned, I think it is sufficient for us to use the squared error as a starting point. We also can’t forget that, using the squared error in a linear regression problem, it will guarantee us only one minimum cost. This simplifies the discussion and we could focus on what the course was targeted to deliver.

@Nikhilesh, your reply is about Tom’s first point. Also, we can’t just say which one is absolutely better than the other here without any details about the problem you are facing. For example, like what Tom said in his third point, if you don’t want to emphasize on large magnitude errors, then you may want to try another loss function.

Raymond