Inverted Dropout

Ok, I think I understand what is going on now:

The method in the paper is the very original method. It was the first paper about dropout. The method that Prof Ng shows is just a different and arguably cleaner way to achieve the same result. The reason that in the original they needed to downscale the weights by multiplying by keep_prob at test time (and every other time they use them) is that they did not use the method of upscaling the activations by 1/keep_prob at training time. The method Prof Ng shows is just another way to achieve the same result: if you multiply the activations by 1/keep_prob during training, it upscales the outputs, which in turn causes the learned weights (coefficients) to be downscaled at training time. So that means you don’t need to downscale them later when you use them. Once you’re done with training, the weights are the weights and you don’t need to worry about what the keep_prob value was or even that the training involved dropout. It’s just a simpler method of achieving the same result. I bet if Prof Hinton had thought of that formulation at the time, they would have written it that way in the paper.