In the lecture of cost function formula, instead of considering the error average, it is considered the error average divided by two. Why is that? What do we gain by dividing it by 2?

I think it is to make the formula for the gradient look a bit nicer, you’ll see in a few videos later!

Agree with @Stuart_Fong1. It’s a matter of choice for having 2m, or 20m, or just m. For example, If we change from 2m to 20m, the gradient will end up with a coefficient of \frac{1}{10m}, and given a ten times larger learning rate \alpha, you can expect the same training result as before the change.

Note that our update formula (for w) is w := w - \alpha \times \frac{\partial{J}}{\partial{w}}

You can make such changes in week 1’s last optional lab for gradient descent and see how it works!

Though it’s a matter of choice, it’s also the convention to use 2m.

Yes, I understand that it is a convention, in the end the thing that matters the most is the error term, but why it is that way? Convention for what? What does it simplify? A derivative? the 2’s will be cancelling or something? Thanks for your answers.

Exactly, the 2’s will be cancelled out after taking the derivative.

So the error function is a square, so the derivateive has a 2 multiplier, that cancels out with the 1/2. At least this is how i understood it. So it is not necessary, but convenient

derivative of x^2 = 2*x so if you say 1/2*x^2 and using the derivative you have x

Hey, I found this answer on stack exchange.

It says that:

**We are dividing 1 by 2m instead of m so that the cost function doesn’t depend upon the number of training examples, this helps us in better comparison.**

Hello @Param302

I am not sure if that answer is accurate.

Dividing by m is so that the cost function does not depend on the number of samples. Further dividing by 2 helps to cancel out the 2 in the numerator, which appears when we take the derivative of Error^2.

Right. It is just a convenient simplification. Of course the minimal solution you get will be the same either way: the parameters that minimize J also minimize 2J.

This is very interesting. We already undertood the 2, but then we also know the rationale for the m. Thanks!