Dropout scaling fix (division by keep_prob)

jcwhitehead · May 17, 2021, 1:16pm

In the “Dropout regularisation” video at around 6:50 we see the idea of modifying the linear combination defining Z[l] to compensate for the absent (dropped-out) weights, by dividing by keep_prob.

I understand that on average, the number of non-deactivated terms in the sum will be keep_prob * N, so the factor of 1/keep_prob rescales the matrix product W[l] · A[l-1] in Z[l] to keep a normalisation consistent with the no-dropout case, on average.

My question is: since we know the exact proportion of kept/active neurons in the sum - not just the expected proportion over many repetitions - why don’t we divide by that fraction instead? Then we wouldn’t just compensate for the dropped-out neurons on average, but exactly every time.

Thanks!

PS If there’s a way to use LaTeX, please tell me and I’ll update my equations.

kampamocha · May 17, 2021, 2:57pm

Hi @jcwhitehead, there’s a post similar to yours, maybe you want to check it out.

jcwhitehead · May 17, 2021, 5:51pm

Hi @kampamocha - thanks for the link.

I agree with everything you write there but it still leaves me wondering about the merits of the two alternatives.

Having thought about it a bit more, as coded in the course we get a binomial distribution on the number of kept neurons, so there’s a non-zero probability of simultaneously deactivating all of them. This will be especially relevant for small layers (with n=4 and keep_prob = 0.8, this is about 1/600, so very possible over many iterations or a lot of data) or low keep_prob values.

If instead of keep_prob we divided by the actual fraction of kept neurons we’d need to forbid the all-deactivated case, to prevent division by zero. So I can see why we don’t just make that change to the current code, as it would mostly run fine but every so often would throw an error.

Which raises the question: why do we allow the all-deactivated case anyway? Isn’t it pure noise, even if relatively rare?

Do we implicitly assume with a lot of this that gradient descent is robust enough, and the underlying function simple enough, that it can basically recover gracefully from any occasional disruptions?

b0otable · September 28, 2022, 11:27am

I had the same question and concerns, would be interested if there were any more discussion on why this topic. It didn’t seem like it would be computationally significant to use exact numbers and within the video Andrew talks about how the whole reason to divide by the keep_prop is to ensure that the z estimate is scaled appropriately and not change the expected value of a.

Using keep_prop vs the actual number neurons kept seems like you would needless add noise to your estimates and as pointed out could have a larger impact on smaller hidden layers.

Topic		Replies	Views
Why do you divide the activations by keep_prob when you use drop Improving Deep Neural Networks: Hyperparameter tun coursera-platform	7	729	May 22, 2023
[C2W1 - Regularization] A question about inverted dropout scaling factor Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	1086	January 27, 2024
Week 1 - Doubt in Dropout Regularization lecture video Improving Deep Neural Networks: Hyperparameter tun coursera-platform	7	752	June 16, 2021
Inverted Dropout Improving Deep Neural Networks: Hyperparameter tun coursera-platform	22	1799	July 27, 2023
Doubt about the implementation of inverted dropout Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	830	November 19, 2024

Dropout scaling fix (division by keep_prob)

Related topics