In dropout why we want to maintain expected value of a[l]

plutonic18 · April 11, 2023, 4:51pm

Why do we want to keep the expected value of a[l] the same by dividing it by the keep-prob? What if we don’t do it?

One answer I got from previous posts: is to keep a normalization consistent with the no-dropout case, on average.

Honestly, I didn’t understand what does it mean by “normalization consistent with the no-dropout case”

Then the final question is why do we want to keep the normalization consistent with the no-dropout case?

Christian_Simonis · April 11, 2023, 6:49pm

dropout sets randomly activations of neurons to zero. By this regularization we can use induce additional„noise“ to the training process which can prevent overfitting of the neural network.

Let’s take an illustrative example and let’s set the dropout rate very high: Imagine you have a dropout rate of let’s say 50 %. Then half of your activations would be missing since they would be „dropped out“ = set to 0 output.

Besides that the training itself would probably not succeed, the biggest reason from my perspective is the following:

If we would not compensate for the dropped-out activations, the model performance would be very much problematic due to activations which are just not representative. If you are in doubt why, feel free to take a look how a typical histogram as statistical distribution of the output activations looks like as described here.

Accordingly, the model performance would suffer a lot due to a highly imbalanced distribution of activations if we would not compensate for the drop out rate resp. w/ keep-prob. This should also become apparent in a bad training performance.

Here you can find some further useful explanation:

Hope that helps!

Best regards
Christian

Christian_Simonis · April 11, 2023, 7:19pm

It just means „no dropout at all“. Which would mean that you have (compared to our previous example) the „original“ histogram of activation, e.g with an expected value of 10 as an arbitrary example.

If you would dropout 50% ( as in our previous example), the expected value of the histogram would be e.g. 5.

In order to keep the activation distribution consistent and representative (independent of your choice of the dropout parameter), you would do the scaling using keep-prob. In our case we would divide by 50% = 0.5 (= 1 - dropout rate) which would bring the expected value from 5 to 10 back again to get a grip on the imbalance risk.

Best regards
Christian

paulinpaloalto · April 12, 2023, 4:00am

The key point to realize is that all forms of regularization (dropout, L2, Lasso …) only happen at training time, not at test time or in normal inference mode (prediction). So if you train without the reverse scaling, the later layers after the dropout layers will not be trained to expect the same amount of “energy” from the previous layers.

This question has been asked a number of times before. Here’s a good thread to read, which also points to this one.

Topic		Replies	Views
Inverted dropout Intuition? Improving Deep Neural Networks: Hyperparameter tun	3	671	May 24, 2022
Inverted Dropout Improving Deep Neural Networks: Hyperparameter tun	22	1784	July 27, 2023
Why do you divide the activations by keep_prob when you use drop Improving Deep Neural Networks: Hyperparameter tun	7	715	May 22, 2023
Expected Value Stays the Same After Scaling In Dropout Improving Deep Neural Networks: Hyperparameter tun	2	561	October 4, 2021
[C2W1] Dropout Regularization - Lecture issue Improving Deep Neural Networks: Hyperparameter tun	2	539	January 11, 2022

In dropout why we want to maintain expected value of a[l]

Related topics