Why do we want to keep the expected value of a[l] the same by dividing it by the keep-prob? What if we don’t do it?
One answer I got from previous posts: is to keep a normalization consistent with the no-dropout case, on average.
Honestly, I didn’t understand what does it mean by “normalization consistent with the no-dropout case”
Then the final question is why do we want to keep the normalization consistent with the no-dropout case?
1 Like
Hi @plutonic18
dropout sets randomly activations of neurons to zero. By this regularization we can use induce additional„noise“ to the training process which can prevent overfitting of the neural network.
Let’s take an illustrative example and let’s set the dropout rate very high: Imagine you have a dropout rate of let’s say 50 %. Then half of your activations would be missing since they would be „dropped out“ = set to 0 output.
Besides that the training itself would probably not succeed, the biggest reason from my perspective is the following:
If we would not compensate for the dropped-out activations, the model performance would be very much problematic due to activations which are just not representative. If you are in doubt why, feel free to take a look how a typical histogram as statistical distribution of the output activations looks like as described here.
Accordingly, the model performance would suffer a lot due to a highly imbalanced distribution of activations if we would not compensate for the drop out rate resp. w/ keep-prob. This should also become apparent in a bad training performance.
Here you can find some further useful explanation:
Hope that helps!
Best regards
Christian
It just means „no dropout at all“. Which would mean that you have (compared to our previous example) the „original“ histogram of activation, e.g with an expected value of 10 as an arbitrary example.
If you would dropout 50% ( as in our previous example), the expected value of the histogram would be e.g. 5.
In order to keep the activation distribution consistent and representative (independent of your choice of the dropout parameter), you would do the scaling using keep-prob. In our case we would divide by 50% = 0.5 (= 1 - dropout rate) which would bring the expected value from 5 to 10 back again to get a grip on the imbalance risk.
Best regards
Christian
The key point to realize is that all forms of regularization (dropout, L2, Lasso …) only happen at training time, not at test time or in normal inference mode (prediction). So if you train without the reverse scaling, the later layers after the dropout layers will not be trained to expect the same amount of “energy” from the previous layers.
This question has been asked a number of times before. Here’s a good thread to read, which also points to this one.