Ok, I think I understand what is going on now:

The method in the paper is the very original method. It was the first paper about dropout. The method that Prof Ng shows is just a different and arguably cleaner way to achieve the same result. The reason that in the original they needed to downscale the weights by multiplying by *keep_prob* at test time (and every other time they use them) is that they did *not* use the method of upscaling the activations by *1/keep_prob* at training time. The method Prof Ng shows is just another way to achieve the same result: if you multiply the activations by *1/keep_prob* during training, it upscales the outputs, which in turn causes the learned weights (coefficients) to be downscaled *at training time*. So that means you *don’t* need to downscale them later when you use them. Once you’re done with training, the weights are the weights and you don’t need to worry about what the *keep_prob* value was or even that the training involved dropout. It’s just a simpler method of achieving the same result. I bet if Prof Hinton had thought of that formulation at the time, they would have written it that way in the paper.