[C2W1 - Regularization] A question about inverted dropout scaling factor

(this was previously posted on the forum, then moved here)
W​hen using inverted dropout, we follow these steps:

1​. Compute dropout matrix D

D =  matrix of same shape as A, 
        with zeros and ones (where ones have keep_prob probability)

2. Update Activation matrix by zeroing some elements

A = A[Where D1 is 1, else 0] 

3. Rescale the activation matrix

A = A / keep_prob

M​y question: would it not be more accurate to divide by the actual “activation reduce factor for this layer”; i.e if we dropped 3 out of 12 activations, then we should scale by a factor of 12/9.

F​or example, this should be written as

A = A * A.size / np.sum(D)

I​ tried it in the Week 1 Regularization assignment, and my results are similar (i.e 92.45% on the training set, and 95% on the test set).


However, I suspect that this method might be a little more precise especially when keep_prob is closer to 1:

B​elow, is the output when I change keep_prob to 0.9 with the original factor (1/keep_prob)

A​nd below is the output when I use the proposed factor (A.size / np.sum(D))

3 Likes

Hi @pthom,

That’s interesting, I remember wondering this myself when I first encounter it.

What I think right now is that at any given iteration the actual fraction of neurons being kept can vary slightly from the keep_prob factor, however the stochastic nature of the technique as you iterate more can compensate for this, and in the end balanced out the expectation.

Another reason could be that the difference in results is not significant wether you use either factor, and is cheaper to perform the adjustment with the keep_prob value than the actual fraction of turned-on neurons.

Also note that originally Dropout was proposed with the scaling down of the weights at test time (multiplying by keep_prob), which was already an approximation since at test time usually you don’t have the exact number of neurons kept on during training.

The original paper (Section 10, second paragraph) mentions explicitly the use of 1/keep_prob (1/p in the paper) as scaling factor at training time.

Hope that helps.

4 Likes

I had the same thought while watching the video.

So, the takeaway is (if I’m reading this correctly) that it is “more correct” to use the ACTUAL proportion kept but, because the expected difference in results is assumed to be insignificant and using the keep_prob directly is computationally cheaper, we just use the keep_prob factor instead?

Hello @mjhapp,

Thanks for digging up this very interesting conversation! :wink: I think your takeaway has captured the essense, and I agree with your takeaway.

Cheers,
Raymond