Batch normalization for regularization

Andrew said Batch normalization has a regularization effect.
It is true that the mean and SD calculated will have noise due to calculation over the minibatch size and not the entire examples.
This mean is subtracted from the current Z and SD is divided. Help me understand why does this have the same effect as dropout . As in dropout our neuron might get entirely dropped and the Z value entirely becomes 0. So it makes sense that the weights will get spread out as the model can’t rely on any single neuron. I can not see any such effect in normalization. Help me understand

Hi @vaibhavoutat ,

Lets review BN:

  1. Batches are randomly created.
  2. On each batch, the BN multiplies each unit by a random value (the SD of the randomly-generated batch).
  3. Also, the BN subtracts from each unit a random value (the mean of the randomly generated batch)

At the end, BN injects a noise, like Dropout does, that force each layer to learn to handle variations in its inputs.

What do you think about this?


Hi, @vaibhavoutat !

Just as a quick note, check How Does Batch Normalization Help Optimization? paper. It refutes the initial explanation that BN reduced the internal covariate shift. Instead, they show that it

" […] makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training."

Let me know if I got this right. In dropuout we drop the node with some probability, this is the noise in dropuout which forces the network to spread out the weights.
In BN we are multiplying with some SD (which might not drop out the node entirely but it will certainly introduce some noise) and this helps the node to spread out the weights as it can’t totally rely on the node with noise.