Cost functions equivalence

Hugo_Souto · March 6, 2024, 3:33pm

Related to the cost function, as explained in the W1 lecture “Cost Function Formula”](Coursera | Online Courses & Credentials From Top Educators. Join for Free | Coursera), the cost function is divided by m (1/m) and by 2m (1/2m) can be used interchangeably, with the second version being preferred because it facilitates future calculations.

How does it make sense for these two functions to be equivalent, with one being half the value of the other? What other adaptations should be made so that they can be used interchangeably?

hackyon · March 6, 2024, 3:37pm

The loss function by themselves are not equivalent. What the lecture says is that minimizing one is equivalent to minimizing the other.

For example, you have a function f(x). If you find params that minimize f(x), then those same params will also result in the minimum for f(x)/2.

Hugo_Souto · March 6, 2024, 4:01pm

First of all, thanks for the prompt response!

So, you’re saying that they can be “used interchangeably” because what really matters is the effect of their results (or the parameters found), not the absolute values?

Because when applied in gradient descent function (as in 3:25 of [Gradient descent for linear regression | Coursera](https://www.coursera.org/learn/machine-learning/lecture/lgSMj/gradient-descent-for - linear regression)), using the cost function divided by m (1/m) and by 2m (1/2m) will return different results. So, the key to the question is that these absolute values don’t really matter, what matters is their effect on the descent on the gradient, so to speak?

hackyon · March 6, 2024, 6:03pm

Yup, that is correct.

The actual value of the loss itself isn’t particularly important. What we care about is the model params that produces a minimal loss. Therefore, the 2 equations can be used interchangeably.

Yup, that is correct as well.

We don’t really care about the absolute value of the loss. What we care about is that 1) the model has params that results in a minimal loss (or as close to minimal as we can make it), 2) the loss is decreasing as you are training it.

For example, you might have a model A that claims to have a loss of 1000, while another completely different model B claims to have a loss of 10. This doesn’t mean model B is better than model A.

In fact, you can’t tell which model is better from just this information, and you’d need to evaluate the models using a test/validation set to know which is better.

The division by 2 shouldn’t be an issue for the gradients, either. It simply scales the gradients a little, and that can be easily offset by the learning rate.

Hugo_Souto · March 6, 2024, 7:12pm

What an excellent answer, now I am sure of it, thank you very much!

Topic		Replies	Views
Difference between loss function and cost function? Advanced Learning Algorithms week-module-2	2	1598	July 11, 2023
Minimizing the cost function question Supervised ML: Regression and Classification week-module-2	24	1033	July 15, 2022
Why the linear regression and classification have identical Gradient Function? Supervised ML: Regression and Classification week-module-2	5	695	February 11, 2023
Does gradient descent of cost function give the same regression line as ordinary least squares? Supervised ML: Regression and Classification week-module-1	5	535	September 27, 2022
Why the derivative of the cost functions used in gradient descent is the same for linear and logistic regression, while the costs function that are derived originally different? Supervised ML: Regression and Classification week-module-3	2	26	November 17, 2024

Cost functions equivalence

Related topics