in the video: " Why Regularization Reduces Overfitting?", the teacher mentioned many times, if the lambd is big, the w will get close to 0, can someone explain to me why in terms of the new cost function J ? I want to understand this from the equation.
my understanding sofar, If the lambd is big, the new regularization team will be big, then J is big. Since we want J is small, we want the new reg term to be small as well. If I give lambd big, so the w will go small?? I am not sure if I truly understand it or not…
Regualarization term will come into effect during the backward pass when updating the weights using gradients. The additional regularization term will push the weights to a smaller value.
Please check the 2nd assignment for the week. You’ll implement regularization from scratch.
thanks for the reply. but that’s not what I am asking. I am asking why, I wish to understand from the side of math.
Look at what the L2 regularization term is: it’s the sum of the squares of all the individual elements of all the W matrices times a constant based on \lambda. So how do you minimize that? By making the absolute values of all those elements as small as possible, right? Well there is an absolute minimum there: set them all to zero. But then the model becomes trivial, meaning that it ignores the input data and always makes the same prediction on any input. So the important question is what value to choose for \lambda: if you make it large, then the regularization term dominates J and all it does is push the W values to zero. Of course the point is that J is now the sum of two terms and you want a balance between the “real” cost based on cross entropy loss (which actually measures the accuracy of the predictions of the model) and the regularization term. You want to suppress the weights somewhat to reduce overfitting, but not too much or the model just becomes 0.
I think I am followed. So, I wanna double confirm:
=Since we want J is small, we want the new reg term to be small as well. If I give lambd big, so the w will go small because of the sum of squares. Am I correct?
Yes, that’s correct. The point is that with L2 regularization, there are now two terms in the cost J:
- The normal cross entropy loss, which measures the actual accuracy of the predictions of the model.
- The L2 regularization term.
If you make the \lambda value too large, then the L2 term is going to dominate the cost and just drive the W values close to 0. In the limit, they all become 0, which, as I pointed out in my previous response, makes the model useless.
So you need both terms in J to play their intended role, which requires that you tune the value of \lambda appropriately. Prof Ng spends lots of time in the lectures in Week 1 and Week 2 discussing how to tune hyperparameters in a systematic way.
thank you thank you Paul! you always helped!