(this was previously posted on the forum, then moved here)
When using inverted dropout, we follow these steps:
1. Compute dropout matrix D
D = matrix of same shape as A,
with zeros and ones (where ones have keep_prob probability)
2. Update Activation matrix by zeroing some elements
A = A[Where D1 is 1, else 0]
3. Rescale the activation matrix
A = A / keep_prob
My question: would it not be more accurate to divide by the actual “activation reduce factor for this layer”; i.e if we dropped 3 out of 12 activations, then we should scale by a factor of 12/9.
For example, this should be written as
A = A * A.size / np.sum(D)
I tried it in the Week 1 Regularization assignment, and my results are similar (i.e 92.45% on the training set, and 95% on the test set).
However, I suspect that this method might be a little more precise especially when keep_prob is closer to 1:
Below, is the output when I change keep_prob to 0.9 with the original factor (1/keep_prob
)
And below is the output when I use the proposed factor (A.size / np.sum(D)
)