I would state the intuition about how dropout works a bit differently than you do. The point is that the neurons that get dropped are different on each iteration, so the effect is to dampen overfitting by weakening specific connections between the outputs at one level and the inputs at the next level. Exactly how strong that weakening effect is depends on the keep_prob
value that you use, of course. Maybe that subtlety in the intuition doesn’t really affect the bigger point you are making here, but I thought it was worth stating.
The problem is not that they can’t learn, the question is what they learn. If you don’t do the reverse scaling, then they learn potentially different things: they learn to react to weaker outputs, because that’s what they are trained on. But then the point is what they have learned may not fit as well with the actual data that they see when you run actual predictions without the dropout logic in place, because the outputs have more “energy”. Did you read far enough in that thread I linked to see part about the L2 norms of the outputs. Maybe that was earlier in the thread than the link I gave you.
Maybe you are thinking too hard here. It actually seems like a pretty straightforward argument: you want the training to be closer to what happens in prediction mode. You only want the stochastic weakening of the reaction to particular neuron outputs without a general decrease in the L2 norm of the inputs.