Mathematical proof for the cost function

Hi everyone!

In the week one study material, we discussed using cost function (squared error cost function) to evaluate the variable selections for linear regression. I am curious if anyone can help me to understand why the model fits data best when cost is closest to zero? How can we find the mathematical proof of it?


The squared error cost function is literally the square of the Euclidean Distance between the correct answer (provided by the labelled data) and the actual prediction of the model. So the closer the prediction is to the correct answer, the lower the distance (cost) will be, right? It’s literally based on the definition of distance.

The overall point here is that linear regression is predicting a numeric value (the price of a house, the temperature, a stock price …), so the usual notion of distance is the ideal cost function. Pretty soon we will learn about a different type of system in which we are trying to predict a “classification”, as opposed to a numeric value. For example “this picture contains a cat” or “this picture does not contain a cat”. When implementing a “classifier” is our goal, the notion of “distance” becomes a bit more complicated. Please stay tuned to hear what Prof Ng has to say about how we can deal with that case!

If the question is “Why do we square the error?”, it’s because we want the error to always be positive since it represents the distance from the correct answer.

I had the same question and wondered “Why not just used the absolute value of the error instead?” and a professor pointed out that the derivative of the absolute value function |x| is not continuous. You can see here derivative of |x| - Wolfram|Alpha that it jumps from -1 to 1 right when x passes 0.

The derivative of x^2 on the other hand is continuous: derivative of x^2 - Wolfram|Alpha . I don’t know the exact reason this is important, but I have a feeling it relates to our ability to apply gradient descent.

(edit: The derivative of |x| is also completely flat on both ends of the jump, which might be what makes it impossible to use for descent)

1 Like

The answer there is a little more complicated than just positivity. Note that the usual definition of Euclidean Distance is also positive:

D(x, y) = \displaystyle\sqrt{\sum_{i = 1}^{n}(x_i - y_i)^2}

That’s the positive square root by definition, because you always want “distance” to be positive. But the whole point here is that we care about the derivative of the cost function, because we are going to be doing gradient descent in order to minimize the cost (error) when we train the model. If you know some calculus, taking the derivative of a square root is a bit of a mess, right? You apply the normal “derivative of an exponent” rule with the exponent being \frac{1}{2}, so things continue to get messier from there.

But suppose we define the cost as the square of the Euclidean Distance? Then taking the derivative of x^2 is simple. And it gives us a result that is extremely simple and cheap to compute. In fact, we can define the cost to be \frac {1}{2} the square of the distance and then the derivative is even simpler and cleaner.

Also note that square roots are actually very expensive to compute compared to squaring something. And if you use the square root as the cost, then the derivative also involves a square root (well, the multiplicative inverse of a square root).

So using the square of the distance is a win and of course you get the same solution at lower computational cost, because the same parameters that minimize J also minimize \sqrt{J}. In other words, you’ll end up converging to the same solution with either cost function, so why not use the one that’s simpler and cheaper to compute?

There are cases in which people use the absolute value of the difference as the loss function, but it has a couple of disadvantages:

  1. It is not differentiable at x = 0 as you pointed out. In practice this isn’t a big a problem as you might think from a “pure math” standpoint: you just define the derivative at 0 to be either -1 or 1 and run gradient descent and it still converges.

  2. You also mentioned the other disadvantage: because the derivatives are either -1 or 1, the cost function effectively gives the same penalty for either small or large errors. When you use the quadratic loss function, it gives smaller penalties for smaller errors and really big penalties as the errors get larger. So the “flatness” of the derivatives in the absolute value case doesn’t stop you from converging with gradient descent, but in most situations the “squared error” cost function gives better results.