We had couple of doubts in the lecture {why does batch norm work } can u please help to clarify please ?
A second reason why batch norm works, is it makes weights, later or deeper than your network, say the weight on layer 10, more robust to changes to weights in earlier layers of the neural network, say, in layer one . what does this statement meaning ?
if you will, it weakens the coupling between what the early layers parameters has to do and what the later layers parameters have to do. And so it allows each layer of the network to learn by itself, a little bit more independently of other layers. what does this statement meaning ?
Above these two points told professor not able to understand. can u please help to clarify ? Im stuck kindly please provide clear description sir
Iâll try my best to answer your questions, but first, please keep in mind that extracting quotes directly from the Professorsâ lectures, tend to miss a lot of context. Hopefully my answer will help with both at the end.
So, think about the connections between layers, in a DNN. Each layer has an input and (except for the first layer) , this input is the output of the previous layer. For the first layer, the input will be x0.
Now: The outputs of each layers, depend on some values: the activations of the previous layer, the weights and biases of the current layer. Consider the level of âdisarrayâ that all these values could have:
Biases and weights were randomly initialized.
There might be large differences in between the test samples in within a mini-batch (eg intra batch) and between batches (inter-batch).
Backpropagation can also cause large changes in the weights and biasing values, which in turn, can cause large differences in between the inputs to the next layer, etc.
So, all of these can just make the training process of a DNN largely inefficient and if you notice the connection between layers (eg: input->output->input->output, etc), you can tell that inefficiencies in an early layer, can lead to larger inefficiencies in subsequent ones.
Batch norm helps to reduce the inefficiencies by reducing that level of âdisarrayâ, which the authors of the paper on Batch norm calls âInternal Covariate Shiftâ (Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift). There are some studies that suggest that doing batch norm on the inputs to a layer, can have tons of benefits and optimize training for a DN (also reducing the chances of exploding or diminishing gradients).
I suggest giving the paper a read. Maybe also check some other sources on the subject:
Could you please guide me to understand this statement taken from the lecture?
âsimilar to dropout, batch norm therefore has a slight regularization effect. Because by adding noise to the hidden units, itâs forcing the downstream hidden units not to rely too much on any one hidden unit.â
I understand that dropout adds noises to hidden units by turning off the units completely with some probability so that downstream hidden unit donât have chance to rely on it.
May i have any example on âdownstream hidden unit not to rely too much on any one hidden unitâ for Batch Norm?