Why does batch norm work?

Hi Sir,

@paulinpaloalto @bahadir @eruzanski @Carina @neurogeek @lucapug @javier @kampamocha

We had couple of doubts in the lecture {why does batch norm work } can u please help to clarify please ?

  1. A second reason why batch norm works, is it makes weights, later or deeper than your network, say the weight on layer 10, more robust to changes to weights in earlier layers of the neural network, say, in layer one . what does this statement meaning ?

  2. if you will, it weakens the coupling between what the early layers parameters has to do and what the later layers parameters have to do. And so it allows each layer of the network to learn by itself, a little bit more independently of other layers. what does this statement meaning ?

Above these two points told professor not able to understand. can u please help to clarify ? Im stuck kindly please provide clear description sir

@paulinpaloalto @bahadir @eruzanski @Carina @neurogeek @lucapug @javier @kampamocha

can you please help to clarify sir and we are stuck here

@paulinpaloalto @bahadir @eruzanski @Carina @neurogeek @lucapug @javier @kampamocha

Sir can u please help to understand the lecturer told points?

Hi @Anbu,

I’ll try my best to answer your questions, but first, please keep in mind that extracting quotes directly from the Professors’ lectures, tend to miss a lot of context. Hopefully my answer will help with both at the end.

So, think about the connections between layers, in a DNN. Each layer has an input and (except for the first layer) , this input is the output of the previous layer. For the first layer, the input will be x0.

Now: The outputs of each layers, depend on some values: the activations of the previous layer, the weights and biases of the current layer. Consider the level of ‘disarray’ that all these values could have:

  • Biases and weights were randomly initialized.
  • There might be large differences in between the test samples in within a mini-batch (eg intra batch) and between batches (inter-batch).
  • Backpropagation can also cause large changes in the weights and biasing values, which in turn, can cause large differences in between the inputs to the next layer, etc.

So, all of these can just make the training process of a DNN largely inefficient and if you notice the connection between layers (eg: input->output->input->output, etc), you can tell that inefficiencies in an early layer, can lead to larger inefficiencies in subsequent ones.

Batch norm helps to reduce the inefficiencies by reducing that level of ‘disarray’, which the authors of the paper on Batch norm calls ‘Internal Covariate Shift’ (Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift). There are some studies that suggest that doing batch norm on the inputs to a layer, can have tons of benefits and optimize training for a DN (also reducing the chances of exploding or diminishing gradients).

I suggest giving the paper a read. Maybe also check some other sources on the subject:

Hope that helps!


Dear Mr Jesus Rivero,

Could you please guide me to understand this statement taken from the lecture?

“similar to dropout, batch norm therefore has a slight regularization effect. Because by adding noise to the hidden units, it’s forcing the downstream hidden units not to rely too much on any one hidden unit.”

Specific time in the lecture : 9:39 / 11:39

Why does Batch Norm work? | Coursera

I understand that dropout adds noises to hidden units by turning off the units completely with some probability so that downstream hidden unit don’t have chance to rely on it.

May i have any example on “downstream hidden unit not to rely too much on any one hidden unit” for Batch Norm?

Thank you