Week 1: dropout vs reducing network?

I think your interpretation of why dropout works is incorrect. It is not the fact that we are reducing the activation outputs that is the point: it is the subtle weakening of the dependence of a given neuron on the specific inputs from the previous layer. The point is (as described in the lectures) that you are sampling a different slightly reduced network on every iteration and on every training sample in the batch. This stochastic effect of weakening the connections is what reduces the overfitting. But note that when we actually apply the trained network to make a prediction, dropout is no longer used: we simply use the trained network. That is true of all forms of regularization: they are only applied during training, not during inference. So if we don’t compensate for the reduced “expected value” of the activations, then the network in inference mode will not work as well because it’s been trained to expect less total activation value but it gets values from all the neurons in inference mode.

This question comes up frequently. Here’s another thread worth a look for this particular question. And here’s another one.