Well, remember that it’s the same *keep_prob* value that you’re using for both the downscaling (by dropping nodes) and the upscaling by multiplying by *1/keep_prob*, right? So if it’s close to 1, then so is its multiplicative inverse.

1/0.9 = 1.1111…

So you’re not “substantially increasing” the norm of the activation matrix. You’re increasing it in a manner that is commensurate with the amount you decreased it by doing the dropout. Of course the dropout is both statistical and quantized, so the compensation may not be exact on any given iteration, but the point is that it is commensurate and on a statistical basis will be as close as you can get. Everything here is playing out over hundreds or thousands of iterations, so it’s all statistical behavior in any case.

Here’s a thread that discusses this more and actually shows some examples of the effect of scaling on the 2-norm.

Here’s another interesting thread about dropout that discusses another subtle point: whether the dropout is the same across all samples in the batch.