In the video, Andrew mentioned: ‘’ dropout can formally be shown to be an adaptive form of L2 regularization, but the **L2 penalty on different ways are different depending on the size of the activation is being multiplied into that way** .‘’

Can you please illustrate that??

Hey @Lina_Hourieh,

Welcome to the community. I am a little confused as to what you mean by

Just to be on the same page, L2 penalty simply adds the sum of the squares of the weights, scaled by the regularization parameter \lambda to the cost function and divided by the batch size 2 * m (*where 2 is just a mathematical trick to make the differentiation easier*), which eventually results in minimising the values of the weight parameters, and in-turn, reduces over-fitting.

Now, when Prof Andrew mentions the statement

He is hinting towards the fact that “in the presence of dropout”, the weights are distributed, and most of the neurons are assigned some importance (*although it may be little, but still some*), as opposed to the case of “absence of dropout”, in which, higher significance could be given to some neurons, i.e., the network might rely on some neurons more heavily by assigning larger weights to them. So, essentially, dropout is trying to break down larger weight values into smaller values.

For instance, consider the “absence of dropout”, let’s say we have a large weight assigned to a single neuron `9`

. Now, when we implement the dropout, it re-distributes this weight into say `2, 3, 4`

. Now, although 9 = 2 + 3 + 4, but 9^2 = 81 > 2^2 + 3^2 + 4^2 = 29. So, you see how dropout redistributed the weights to decrease the sum of the squares of the weights, which is the exact thing that L2 penalty does. Let me know if this helps.

Regards,

Elemento