NN Regularization for all W parameters

While calculating the cost function, we want to minimalize the error between the function output of the output layer and the target variable.
Then why during regularization, we regularize for all the W in the neural network? Only regularizing the W in the output layer should work just fine, since it is associated with the output.

Hi @Thala,

The idea that it can avoid overfitting is that the regularization term can push the values of the weights to zero which effectively is “disabling” them. Usually we have an overfit problem because our NN is too large (could be using too many layers, or too many nodes per layer). So, if we keep adding layers but always only do regularization on the output layer, the overfit problem is only going to be more serious. To counter the addition of layers, we need regularization in them.

On the other hand, if you work out the maths on a very simple 2-layer NN, and use only regularization on the output layer, you will find that, because the output layer is regularized, its weights are suppressed, but this will cause the weights in the first hidden layer to increase. So we are just “shifting” the weight from the output layer to the first hidden layer. To counter this “shift”, we need regularization in the hidden layer.

The above points are why we want to do regularization on all layers, but I really cannot comment on what is the best strategy to regularize, meaning that, if I have 20 layers, which layer should I regularize and which shouldn’t I - for this I cannot give you a general answer, but you have full regularization power when you regularize all layers.

At the end of the day, it’s the cv dataset evaluation result that we are relying on in choosing the best model (including where to regularize).


1 Like

Thanks. I understood

Just out of curiosity,
Can we obtain the W values for each neuron in a particular layer?
If so, then it should be possible to get the activation vector produced by each layer’s output (and the next layer’s input) . Am I right?

Yes! @Thala, for example, in C2 W1 Lab: Coffee Roasting in Tensorflow, you can get the weights by

Screenshot from 2022-07-15 13-47-29

You can also create a “sub-model” by using the model’s input as the sub-model’s input, but one of the model’s hidden layer’s output as the sub-model’s output. In this case, your sub-model will be producing the output activation vector of that hidden layer.

Example here.