Inverted Dropout

But the point is that the effect of dropout is that it randomly weakens the connections between specific neurons and the next layer by making the output of a given individual neuron a little more stochastic and less predictable, because it might get zapped in a given iteration. That’s what causes the regularization effect. The point about compensating for the aggregate “energy” by multiplying by 1/keep_prob is just to keep the general level of output the same even though it comes from different neurons.

Everything I’m saying here is just me repeating what Prof Ng says in the lecture maybe with slightly different wording. Since what I’m saying doesn’t seem to be helping, you might want to go back and watch the lectures on Dropout again. He does a way better job of explaining all this talking at the whiteboard than I can by just typing words. Of course he’s also a way better teacher than I can ever hope to be in any case.

There are two ways to implement dropout: you could make the effect be the same on each sample in a given batch or you can make it different for every sample in every iteration. Prof Ng has us build it the latter way and I think that’s what the original paper also says, but it’s a bit ambiguous.