So Andrew sir has first introduced m_j and now he removed this term with a message “Even if we remove it weights will converge to same. Because m_j is just a constant”

This doesnt make sense to me. Basically we divide by the number of elements used in the average. The loss function is MSE so if we remove division by m_j terms it will not be the mean but simply squared error. And \frac{1}{2} to prevent upscaling when 2 will be multiplied on differentiation.

That term isn’t the average of the squares of the weights.
Division my m is correct.
The concept is that the larger the data set, the less regularization is needed.

My question is why did m_j was first used in the loss function and later it was removed saying “add it or remove it doesnt affect the weights learned”?

I have not checked the video, but in any problem involving the minimisation of a loss function, the value obtained after its minimisation will be the same if it gets multiplied by a constant.

As a simple example imagine the loss being L = A*(x - 2) ** 2, with A > 0. The minimum value is x = 2, regardless of the value of A.

In the case of the equation you are showing, as long as the value is being multiplied by the MSE and the L2 part, then you can remove it. If you were to add a regularisation term that does not have the m(j) part, then you cannot remove that value. So, I can see why it is important to show it, but then cross it off for this particular case.

Similarly when we see \frac{1}{2} multiplied to the linear regression gradient descent, just to cancel out the 2 after MSE differentiation. I got it thank you @isaac.casm