Related to the cost function, as explained in the W1 lecture “Cost Function Formula”](Coursera | Online Courses & Credentials From Top Educators. Join for Free | Coursera), the cost function is divided by m (1/m) and by 2m (1/2m) can be used interchangeably, with the second version being preferred because it facilitates future calculations.
How does it make sense for these two functions to be equivalent, with one being half the value of the other? What other adaptations should be made so that they can be used interchangeably?
So, you’re saying that they can be “used interchangeably” because what really matters is the effect of their results (or the parameters found), not the absolute values?
Because when applied in gradient descent function (as in 3:25 of [Gradient descent for linear regression | Coursera](https://www.coursera.org/learn/machine-learning/lecture/lgSMj/gradient-descent-for - linear regression)), using the cost function divided by m (1/m) and by 2m (1/2m) will return different results. So, the key to the question is that these absolute values don’t really matter, what matters is their effect on the descent on the gradient, so to speak?
The actual value of the loss itself isn’t particularly important. What we care about is the model params that produces a minimal loss. Therefore, the 2 equations can be used interchangeably.
Yup, that is correct as well.
We don’t really care about the absolute value of the loss. What we care about is that 1) the model has params that results in a minimal loss (or as close to minimal as we can make it), 2) the loss is decreasing as you are training it.
For example, you might have a model A that claims to have a loss of 1000, while another completely different model B claims to have a loss of 10. This doesn’t mean model B is better than model A.
In fact, you can’t tell which model is better from just this information, and you’d need to evaluate the models using a test/validation set to know which is better.
The division by 2 shouldn’t be an issue for the gradients, either. It simply scales the gradients a little, and that can be easily offset by the learning rate.