Inverted Dropout

I understand that dividing by keep_prob will help in scaling up the value of the activations by roughly the same amount that they have been reduced due to the removal of some nodes.

But why do we need this value to stay roughly constant in the first place?

The reason is that all forms of regularization, including dropout, only happen during training. When we actually use the trained network to make predictions, there will be no dropout (which may be achieved without changing the code by simply setting keep_prob = 1.) So if you don’t scale up to compensate for the dropout, then the later layers will be trained in a way that won’t agree with what they actually are getting when we run normal prediction. They will be trained to expect lower aggregate “energy” from the previous layers, so things may not work as well as expected in normal “prediction” mode without dropout.

It’s been a while since I have watched Prof Ng’s lectures on this topic, so I don’t remember exactly what he says about this, but he must discuss this point. It might be worth watching again to see. (Update: yes, he mentions this between 8:30 and 9:00 in the main lecture on Dropout.)

3 Likes