I understand that L2 will be penalizing bigger coefficients. Multiplying delta with bigger numbers will result in a bigger error. However, where are we checking for this big error?
In other words, I understand that we are ballooning the cost function whenever we have big coefficients. But how and when are we penalizing these big numbers? What are we doing to them? Aren’t we just trying to optimize a bigger cost function?
The penalties for big errors are all provided by the gradients of the cost function. With L2 Regularization, we are adding the regularization term to the cost which includes the sums of the squares of all the weights at all layers. So the gradients will include the derivatives of those terms as well as the “normal” gradients we get from the unregularized cost function, because the gradient of the sum is the sum of the gradients.