[C2W1 - Regularization] A question about inverted dropout scaling factor

pthom · April 24, 2021, 1:04am

(this was previously posted on the forum, then moved here)
When using inverted dropout, we follow these steps:

1. Compute dropout matrix D

D =  matrix of same shape as A, 
        with zeros and ones (where ones have keep_prob probability)

2. Update Activation matrix by zeroing some elements

A = A[Where D1 is 1, else 0]

3. Rescale the activation matrix

A = A / keep_prob

My question: would it not be more accurate to divide by the actual “activation reduce factor for this layer”; i.e if we dropped 3 out of 12 activations, then we should scale by a factor of 12/9.

For example, this should be written as

A = A * A.size / np.sum(D)

I tried it in the Week 1 Regularization assignment, and my results are similar (i.e 92.45% on the training set, and 95% on the test set).

However, I suspect that this method might be a little more precise especially when keep_prob is closer to 1:

Below, is the output when I change keep_prob to 0.9 with the original factor (1/keep_prob)

And below is the output when I use the proposed factor (A.size / np.sum(D))

kampamocha · April 25, 2021, 6:29am

Hi @pthom,

That’s interesting, I remember wondering this myself when I first encounter it.

What I think right now is that at any given iteration the actual fraction of neurons being kept can vary slightly from the keep_prob factor, however the stochastic nature of the technique as you iterate more can compensate for this, and in the end balanced out the expectation.

Another reason could be that the difference in results is not significant wether you use either factor, and is cheaper to perform the adjustment with the keep_prob value than the actual fraction of turned-on neurons.

Also note that originally Dropout was proposed with the scaling down of the weights at test time (multiplying by keep_prob), which was already an approximation since at test time usually you don’t have the exact number of neurons kept on during training.

The original paper (Section 10, second paragraph) mentions explicitly the use of 1/keep_prob (1/p in the paper) as scaling factor at training time.

Hope that helps.

mjhapp · January 26, 2024, 7:34pm

I had the same thought while watching the video.

So, the takeaway is (if I’m reading this correctly) that it is “more correct” to use the ACTUAL proportion kept but, because the expected difference in results is assumed to be insignificant and using the keep_prob directly is computationally cheaper, we just use the keep_prob factor instead?

rmwkwok · January 27, 2024, 3:01am

Hello @mjhapp,

Thanks for digging up this very interesting conversation! I think your takeaway has captured the essense, and I agree with your takeaway.

Cheers,
Raymond

Topic		Replies	Views
Doubt about the implementation of inverted dropout Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	830	November 19, 2024
Course 2 -- Week 1 -- Dropout Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	738	June 28, 2021
Why do you divide the activations by keep_prob when you use drop Improving Deep Neural Networks: Hyperparameter tun coursera-platform	7	729	May 22, 2023
Dropout scaling fix (division by keep_prob) Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	680	September 28, 2022
Week 1, Dropout Regularization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	656	June 8, 2022

[C2W1 - Regularization] A question about inverted dropout scaling factor

Related topics