Dropout as a more Adaptive Form of L2 Regularization

Mahdi_Fatemi · January 4, 2022, 3:16pm

In week 1 of course 2, Regularizing your Neural Network section, Understanding Dropout video, time 1:45, Doctor Ng. says this:

‘but it turns out that dropout can formally be shown to be an adaptive form of L2 regularization, but the L2 penalty on different weights are different depending on the size of the activation being multiplied into that way.’

What does L2 regularization or dropout have to do with activations ? why is the L2 penalty on different weights depended on size of activation being multiplied?

paulinpaloalto · January 4, 2022, 5:24pm

All forms of regularization are causing their effect by changing the cost function, either pretty directly in the L2 case or indirectly in the dropout case. And all the gradients are of J w.r.t. the various parameters, which means that all activation functions between a given weight or bias value and J have an effect on the result. You can think of Back Prop as a huge serial application of the Chain Rule, right?

Mahdi_Fatemi · January 4, 2022, 6:40pm

So you say that the activations take role in both L2 and dropout regularization methods because they are forming J and in back propagation we go through all those activations backwards?
If this is what you say I must point out that Dr. Ng. has used ‘but’ before his statement of ‘the L2 penalty on different weights are different depending on the size of the activationbeing multiplied into that way.’ which means the same doesn’t happen in dropout !

paulinpaloalto · January 4, 2022, 8:22pm

Yes, the gradients always include the derivatives of the activation functions, because of the Chain Rule. The point is that L2 adds a direct penalty on the magnitudes of all the weights, so I guess I simply don’t understand the statement that you quote. Sorry.

X0450 · May 15, 2022, 6:21am

Sorry, I don’t really understand how dropout is causing its effect by changing the cost function. Could you please clarify it? Thank you.

paulinpaloalto · May 15, 2022, 3:20pm

Sorry, I think my previous statement is maybe a little misleading. L2 does directly change the cost function by adding the regularization term, which uses the magnitude of the weights as a “penalty”. But dropout does not change the cost function itself: it just changes the individual activation outputs by randomly “killing” different neurons in the relevant layers on each iteration. That will change the cost and the gradients, but doesn’t really change the cost function itself, if you consider that to be just what happens after the output of the last layer. But note (as Prof Ng says in one of the dropout lectures) that you’re really training a different network on every iteration when you use dropout.

So technically if you consider the cost function as the complete path from the inputs to the final J value, you have changed it, but indirectly by changing the architecture of the network subtly. That was what I meant to say.

The overall point is that the J values are not comparable between different networks. So if dropout is producing literally different networks on every training iteration, it’s not mathematically correct to plot the different J values on one graph as if they are comparable. Of course if you want to look at all this from a statistical perspective, it probably doesn’t really matter that much, as long as the dropout percentage is not too extreme. But from a purely mathematical viewpoint, it doesn’t make sense to compare J values between different networks.

Topic		Replies	Views
Week 1, Understanding Dropout Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	538	July 18, 2022
Difference between cost function of L2 and dropout regulariztion - Week1 Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	559	December 19, 2022
Regularization Intuition In Programming Assignment Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	531	July 13, 2021
Dropout technique makes me confused Improving Deep Neural Networks: Hyperparameter tun coursera-platform	7	749	May 12, 2022
Backpropagation when using dropout and Regularization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	606	February 11, 2022

Dropout as a more Adaptive Form of L2 Regularization

Related topics