Neural network regularization: What does it mean to regularize a hidden layer?

In lecture, Prof. Ng motivates that regularizing a neural network can help reduce variance even in large neural networks. He then describes how a neural network can be regularized, and presents the cost function:

J(\mathbf{W},\mathbf{B})=\frac{1}{m}\sum_{i=1}^mL(F(x^{(i)},y^{(i)})+\frac{\lambda}{2m}\sum w^2

He then shows what this looks like in TensorFlow code:

layer_1 = Dense(units=25, activation='relu', kernel_regularizer=L2(0.01))
layer_2 = Dense(units=15, activation='relu', kernel_regularizer=L2(0.01))
layer_3 = Dense(units=1, activation='sigmoid', kernel_regularizer=L2(0.01))
model = Sequential([layer_1,layer_2,layer_3])

The TensorFlow code in particular is confusing to me, and makes me wonder if my conception regularization is accurate.

So far in lectures, we have only seen regularization applied to a cost function, not activation functions or z values. It seems odd that in TensorFlow, the regularization is specified on each layer, rather than in the compile method where we provide a loss function as an argument.

I’m not sure what a regularizer on a hidden layer would even do. Could someone help clarify 1) if my understanding of regularization as applying to cost is accurate, and if so, 2) why does TensorFlow specify regularization on each layer?

Thank you!

The tensor flow code doesn’t show how regularization happens. It just shows that there’s a “lambda” value used for regularizing the weights in each layer.

The essence of regularization is that it adds some additional cost based on the magnitudes of the weight values. So this creates an incentive for the weight values to be reduced slightly, while at at the same time still trying to make good predictions on the training set.

Overall this causes the predictions to be less good on the training set, but the tradeoff is that you can get better predictions on data that wasn’t in the training set.

Hi @James_Webb


I don’t represent Tensorflow, but in my opinion, if I were in the shoes of Tensorflow developers, it makes total sense because, if we think carefully, the implementation effect of (e.g.) the L2 regularizer is that, when a layer gets updated, the layer’s weights are going to be further reduced by \alpha\lambda w. This reduction does not affect anything outside of the layer. Therefore, if the regularizer is specified within the layer, it is convenient.

Again, I am just providing you with another angle to rationalize this, and I don’t mean to say humans cannot design another neural network library that specifies a cost function in the compile() stage where cost function is provided. However, the developers will need to think about how to pass the form of the regularization and the regularization parameter back to each of the layers, or how to pass the weights forward.


Thank you, Raymond. Always appreciate the thoughtful replies.

You are welcome, @James_Webb!