Dropout as a more Adaptive Form of L2 Regularization

In week 1 of course 2, Regularizing your Neural Network section, Understanding Dropout video, time 1:45, Doctor Ng. says this:

‘but it turns out that dropout can formally be shown to be an adaptive form of L2 regularization, but the L2 penalty on different weights are different depending on the size of the activation being multiplied into that way.’

What does L2 regularization or dropout have to do with activations ? why is the L2 penalty on different weights depended on size of activation being multiplied?

All forms of regularization are causing their effect by changing the cost function, either pretty directly in the L2 case or indirectly in the dropout case. And all the gradients are of J w.r.t. the various parameters, which means that all activation functions between a given weight or bias value and J have an effect on the result. You can think of Back Prop as a huge serial application of the Chain Rule, right?

1 Like

So you say that the activations take role in both L2 and dropout regularization methods because they are forming J and in back propagation we go through all those activations backwards?
If this is what you say I must point out that Dr. Ng. has used ‘but’ before his statement of ‘the L2 penalty on different weights are different depending on the size of the activationbeing multiplied into that way.’ which means the same doesn’t happen in dropout !

Yes, the gradients always include the derivatives of the activation functions, because of the Chain Rule. The point is that L2 adds a direct penalty on the magnitudes of all the weights, so I guess I simply don’t understand the statement that you quote. Sorry.

Sorry, I don’t really understand how dropout is causing its effect by changing the cost function. Could you please clarify it? Thank you.

Sorry, I think my previous statement is maybe a little misleading. L2 does directly change the cost function by adding the regularization term, which uses the magnitude of the weights as a “penalty”. But dropout does not change the cost function itself: it just changes the individual activation outputs by randomly “killing” different neurons in the relevant layers on each iteration. That will change the cost and the gradients, but doesn’t really change the cost function itself, if you consider that to be just what happens after the output of the last layer. But note (as Prof Ng says in one of the dropout lectures) that you’re really training a different network on every iteration when you use dropout.

So technically if you consider the cost function as the complete path from the inputs to the final J value, you have changed it, but indirectly by changing the architecture of the network subtly. That was what I meant to say.

The overall point is that the J values are not comparable between different networks. So if dropout is producing literally different networks on every training iteration, it’s not mathematically correct to plot the different J values on one graph as if they are comparable. Of course if you want to look at all this from a statistical perspective, it probably doesn’t really matter that much, as long as the dropout percentage is not too extreme. But from a purely mathematical viewpoint, it doesn’t make sense to compare J values between different networks.