I had a query about the scaling up of the activation matrix.
So when you keep the keep_prob close to 1, say 0.9 or 0.8 and your probability to lose a node goes near 0 (0.1, 0.2) so if you think in terms of a very large layer (say 100 units), not a lot of weights are being dropped.
But when you scale up the activation matrix by dividing it by the keep_prob, it essentially increases it’s un-zeroed values. Which increases its norm (L2, and Frobenius) substantially.
So, my question is, even despite the norm substantially increasing, and the entire matrix’s values also substantially increasing, would it still practically reduce the cost function after every iteration? Would the weight matrix still be able to be effectively optimized despite this?
[If I've got some concepts wrong, or if you think I'll be able to understand this better after doing the assignment, please enlighten me about that]
Well, remember that it’s the same keep_prob value that you’re using for both the downscaling (by dropping nodes) and the upscaling by multiplying by 1/keep_prob, right? So if it’s close to 1, then so is its multiplicative inverse.
1/0.9 = 1.1111…
So you’re not “substantially increasing” the norm of the activation matrix. You’re increasing it in a manner that is commensurate with the amount you decreased it by doing the dropout. Of course the dropout is both statistical and quantized, so the compensation may not be exact on any given iteration, but the point is that it is commensurate and on a statistical basis will be as close as you can get. Everything here is playing out over hundreds or thousands of iterations, so it’s all statistical behavior in any case.
Here’s a thread that discusses this more and actually shows some examples of the effect of scaling on the 2-norm.
Here’s another interesting thread about dropout that discusses another subtle point: whether the dropout is the same across all samples in the batch.