Dropout technique makes me confused

Why would you use dropout technique if you can rebuild a smaller NN?

Dropout does not mean that you are building a smaller network.

During each training iteration, few nodes are randomly switched off and you train a smaller part of the network.
As a result, each node in the neural network learns to pay attention to all inputs from previous layer and not just a few nodes.

Here’s an example from an input layer perspective:
Say you have 10 features in your input and height is an important feature. Without dropout, your NN could learn to pay a lot of attention to this 1 feature. With dropout applied to the input layer, sometimes, the height feature could be made unavailable (i.e. dropped out) and the NN has to predict the target by using just the other features. As a result, NN will learn to spread its weights across all features over time not just the 1 feature.


Oh, so after training your NN has the same size as it was before the dropout?

Architecture of NN remains unchanged.

1 Like

Oh, okay. Thank you. Now it is clear to me!

Here’s another thread that discusses the question of why you don’t just reduce the size of the network to eliminate overfitting instead of applying dropout.

Thank you for your solution. I understand the similarity between the idea of dropout and l2 regularization. but could you please explain the difference between them? The lecture talks about it in the following way

’ L2 penalty on different weights are different, depending on the size of the activations being multiplied that way.

But to summarize, it is possible to show that drop out has a similar effect to L2 regularization. Only to L2 regularization applied to different ways can be a little bit different and even more adaptive to the scale of different inputs.’
Thank you in advance.

Here’s how L2 and dropout are different.

In case of L2, gradient of each weight has an additional factor of \frac{\lambda}{m}{weight}. This is a fixed adaptation technique.

When it comes to dropout, that’s not the case. We update weights only for nodes that took part during forward pass for that iteration. This has the effect of learning how to adapt weights by paying more attention to inputs.

It would be best for you to try the 2nd programming assignment for this week since L2 and dropout will have to be implemented from scratch.