Understanding Dropout

HI Sir/Mam,

can someone explain about this and illustrate pls ?

So this unit will be more motivated to spread out this weights and give you a little bit of weight to each of the four inputs to this unit. And by spreading out the weights this will tend to have an effect of shrinking the squared norm of the weights, and so similar to what we saw with L2 regularization. The effect of implementing dropout is that its shrinking the weights and similar to L2 regularization, it helps to prevent overfitting,

who spreading out the weights ? How it will shrinks the squared norm of the weights ?

Please give us a reference to where that quote comes from. If it is one of Prof Ng’s lectures, please give the name of the lecture and the time offset. If it’s from the web, please give us a link.

The point is that this is an effect of how dropout works: because different random neurons are “zapped” on each iteration, it modifies the way the training happens and causes the learned connections between particular outputs of the previous layer and neurons in the given layer to be weaker. And they are suggesting that weaker connections are reflected in having smaller weight values.

Well, what is the squared norm of the weights? It is the sum of the squares of all the elements of W at a given layer of the network. So if the weight values are less (see previous paragraph), it will make the squared norm less as well, right? If you square a smaller number, the result is smaller. We’ve been through this business of considering the meaning of the squared norms before, right? Remember this thread about how inverted dropout works? That was from almost exactly two years ago. A proverbial “Blast from the Past!” :nerd_face: