I didn't undertand why divide by 2m is better

Levent · June 16, 2022, 1:11am

In the lecture of cost function formula, instead of considering the error average, it is considered the error average divided by two. Why is that? What do we gain by dividing it by 2?

Stuart_Fong1 · June 16, 2022, 1:34am

I think it is to make the formula for the gradient look a bit nicer, you’ll see in a few videos later!

rmwkwok · June 16, 2022, 2:00am

Agree with @Stuart_Fong1. It’s a matter of choice for having 2m, or 20m, or just m. For example, If we change from 2m to 20m, the gradient will end up with a coefficient of \frac{1}{10m}, and given a ten times larger learning rate \alpha, you can expect the same training result as before the change.

Note that our update formula (for w) is w := w - \alpha \times \frac{\partial{J}}{\partial{w}}

You can make such changes in week 1’s last optional lab for gradient descent and see how it works!

Though it’s a matter of choice, it’s also the convention to use 2m.

Levent · June 16, 2022, 2:25am

Yes, I understand that it is a convention, in the end the thing that matters the most is the error term, but why it is that way? Convention for what? What does it simplify? A derivative? the 2’s will be cancelling or something? Thanks for your answers.

rmwkwok · June 16, 2022, 2:34am

Exactly, the 2’s will be cancelled out after taking the derivative.

Markus_Degen · July 12, 2022, 1:42pm

So the error function is a square, so the derivateive has a 2 multiplier, that cancels out with the 1/2. At least this is how i understood it. So it is not necessary, but convenient

derivative of x^2 = 2x so if you say 1/2x^2 and using the derivative you have x

Param302 · July 17, 2022, 4:44pm

Hey, I found this answer on stack exchange.
It says that:

We are dividing 1 by 2m instead of m so that the cost function doesn’t depend upon the number of training examples, this helps us in better comparison.

shanup · July 17, 2022, 10:23pm

Hello @Param302

I am not sure if that answer is accurate.

Dividing by m is so that the cost function does not depend on the number of samples. Further dividing by 2 helps to cancel out the 2 in the numerator, which appears when we take the derivative of Error^2.

paulinpaloalto · July 17, 2022, 10:26pm

Right. It is just a convenient simplification. Of course the minimal solution you get will be the same either way: the parameters that minimize J also minimize 2J.

Levent · July 19, 2022, 11:21pm

This is very interesting. We already undertood the 2, but then we also know the rationale for the m. Thanks!

Topic		Replies	Views
Cost Function formula with 1 training example Advanced Learning Algorithms week-2	1	509	January 26, 2023
Query about gradient descent Supervised ML: Regression and Classification week-1	2	543	July 1, 2022
Why was cost function for Logistic reg 1/m and not 1/2m? Supervised ML: Regression and Classification week-3	5	38	September 23, 2024
Cost_function formula_Difference between 2*m & m? Supervised ML: Regression and Classification week-1	5	521	February 2, 2023
Dividing by "m" in back propagation using vectorized implementation Neural Networks and Deep Learning week-3 , coursera-platform	3	461	February 19, 2024

I didn't undertand why divide by 2m is better

Related topics