Why exactly $m_j$ is removed from the loss function

So Andrew sir has first introduced m_j and now he removed this term with a message “Even if we remove it weights will converge to same. Because m_j is just a constant”

This doesnt make sense to me. Basically we divide by the number of elements used in the average. The loss function is MSE so if we remove division by m_j terms it will not be the mean but simply squared error. And \frac{1}{2} to prevent upscaling when 2 will be multiplied on differentiation.

In the course assignment I see it is mentioned squared error. So I believe then it was by mistake added in the class and then later removed

That term isn’t the average of the squares of the weights.
Division my m is correct.
The concept is that the larger the data set, the less regularization is needed.

That’s not an error. In this method the cost is based on the squared error.

Idk if getting it right but I am talking about this part only, not regularization


The numerator there is m.

You mean denominator? Or i didnt get this

Please state your question clearly. I cannot tell from your images which part of the equation you’re asking about.

Can you indicate your question in red ink?

My question is why did m_j was first used in the loss function and later it was removed saying “add it or remove it doesnt affect the weights learned”?

Sorry, I do not understand your question.

Can you give a specific link or video timestamp where this is stated?

Watch this video after 8:30

I have not checked the video, but in any problem involving the minimisation of a loss function, the value obtained after its minimisation will be the same if it gets multiplied by a constant.

As a simple example imagine the loss being L = A*(x - 2) ** 2, with A > 0. The minimum value is x = 2, regardless of the value of A.

In the case of the equation you are showing, as long as the value is being multiplied by the MSE and the L2 part, then you can remove it. If you were to add a regularisation term that does not have the m(j) part, then you cannot remove that value. So, I can see why it is important to show it, but then cross it off for this particular case.

Hopefully, that makes it a bit clearer

1 Like

@isaac.casm, thanks for the reply.

If m is a constant, then it is only a scaling factor on the magnitude of the cost. It has no effect on the weight values that give the minimum cost.

1 Like

Similarly when we see \frac{1}{2} multiplied to the linear regression gradient descent, just to cancel out the 2 after MSE differentiation. I got it thank you @isaac.casm