Why exactly $m_j$ is removed from the loss function

tbhaxor · March 5, 2023, 10:30am

So Andrew sir has first introduced m_j and now he removed this term with a message “Even if we remove it weights will converge to same. Because m_j is just a constant”

This doesnt make sense to me. Basically we divide by the number of elements used in the average. The loss function is MSE so if we remove division by m_j terms it will not be the mean but simply squared error. And \frac{1}{2} to prevent upscaling when 2 will be multiplied on differentiation.

tbhaxor · March 5, 2023, 6:27pm

In the course assignment I see it is mentioned squared error. So I believe then it was by mistake added in the class and then later removed

TMosh · March 5, 2023, 7:31pm

That term isn’t the average of the squares of the weights.
Division my m is correct.
The concept is that the larger the data set, the less regularization is needed.

TMosh · March 5, 2023, 7:32pm

That’s not an error. In this method the cost is based on the squared error.

tbhaxor · March 5, 2023, 7:47pm

Idk if getting it right but I am talking about this part only, not regularization

TMosh · March 5, 2023, 7:51pm

The numerator there is m.

tbhaxor · March 5, 2023, 7:52pm

You mean denominator? Or i didnt get this

TMosh · March 5, 2023, 7:52pm

Please state your question clearly. I cannot tell from your images which part of the equation you’re asking about.

Can you indicate your question in red ink?

tbhaxor · March 5, 2023, 7:54pm

My question is why did m_j was first used in the loss function and later it was removed saying “add it or remove it doesnt affect the weights learned”?

TMosh · March 5, 2023, 7:59pm

Sorry, I do not understand your question.

TMosh · March 6, 2023, 3:18am

Can you give a specific link or video timestamp where this is stated?

tbhaxor · March 6, 2023, 9:45am

Watch this video after 8:30

isaac.casm · March 12, 2023, 6:01pm

I have not checked the video, but in any problem involving the minimisation of a loss function, the value obtained after its minimisation will be the same if it gets multiplied by a constant.

As a simple example imagine the loss being L = A*(x - 2) ** 2, with A > 0. The minimum value is x = 2, regardless of the value of A.

In the case of the equation you are showing, as long as the value is being multiplied by the MSE and the L2 part, then you can remove it. If you were to add a regularisation term that does not have the m(j) part, then you cannot remove that value. So, I can see why it is important to show it, but then cross it off for this particular case.

Hopefully, that makes it a bit clearer

TMosh · March 12, 2023, 6:20pm

@isaac.casm, thanks for the reply.

If m is a constant, then it is only a scaling factor on the magnitude of the cost. It has no effect on the weight values that give the minimum cost.

tbhaxor · April 18, 2023, 12:52pm

Similarly when we see \frac{1}{2} multiplied to the linear regression gradient descent, just to cancel out the 2 after MSE differentiation. I got it thank you @isaac.casm

Topic		Replies	Views
Using per-item features Unsupervised Learning, Recommenders, Reinforcement week-2	2	504	October 20, 2022
Why "m" in Collaborative Filtering Cost function can be deleted? Unsupervised Learning, Recommenders, Reinforcement week-2	11	419	September 2, 2023
Regularization C3_W2_Collaborative_RecSys_Assignment: Exercise 1 errors Unsupervised Learning, Recommenders, Reinforcement week-2	3	578	August 19, 2022
Why does the regularization term in L2 Regularization include division by the number of examples (m)? Improving Deep Neural Networks: Hyperparameter tun week-1	2	25	April 10, 2025
Collaborative filtering Cost Function Unsupervised Learning, Recommenders, Reinforcement week-2	4	504	February 14, 2023

Why exactly $m_j$ is removed from the loss function

Related topics