In the “Dropout regularisation” video at around 6:50 we see the idea of modifying the linear combination defining Z[l] to compensate for the absent (dropped-out) weights, by dividing by keep_prob.
I understand that on average, the number of non-deactivated terms in the sum will be keep_prob * N, so the factor of 1/keep_prob rescales the matrix product W[l] · A[l-1] in Z[l] to keep a normalisation consistent with the no-dropout case, on average.
My question is: since we know the exact proportion of kept/active neurons in the sum - not just the expected proportion over many repetitions - why don’t we divide by that fraction instead? Then we wouldn’t just compensate for the dropped-out neurons on average, but exactly every time.
Thanks!
PS If there’s a way to use LaTeX, please tell me and I’ll update my equations.
I agree with everything you write there but it still leaves me wondering about the merits of the two alternatives.
Having thought about it a bit more, as coded in the course we get a binomial distribution on the number of kept neurons, so there’s a non-zero probability of simultaneously deactivating all of them. This will be especially relevant for small layers (with n=4 and keep_prob = 0.8, this is about 1/600, so very possible over many iterations or a lot of data) or low keep_prob values.
If instead of keep_prob we divided by the actual fraction of kept neurons we’d need to forbid the all-deactivated case, to prevent division by zero. So I can see why we don’t just make that change to the current code, as it would mostly run fine but every so often would throw an error.
Which raises the question: why do we allow the all-deactivated case anyway? Isn’t it pure noise, even if relatively rare?
Do we implicitly assume with a lot of this that gradient descent is robust enough, and the underlying function simple enough, that it can basically recover gracefully from any occasional disruptions?
I had the same question and concerns, would be interested if there were any more discussion on why this topic. It didn’t seem like it would be computationally significant to use exact numbers and within the video Andrew talks about how the whole reason to divide by the keep_prop is to ensure that the z estimate is scaled appropriately and not change the expected value of a.
Using keep_prop vs the actual number neurons kept seems like you would needless add noise to your estimates and as pointed out could have a larger impact on smaller hidden layers.