Hi everyone,
Just trying to make things more simplified for the fresh learners here to what Balaji and Alva Roramajo tried to explain earlier to this query. To those who aren’t aware of the fancy term called “internal covariate shift”, let me explain it for you.
So, from where this term came out actually? When we apply batch normalization as a feature, which is added between the layers of the neural network, it continuously takes the output from the previous layer and normalizes it before sending to the next layer, Right!
![batchnorm2_pic](https://global.discourse-cdn.com/dlai/original/3X/f/b/fb0384f4ad07cfcd4b6bc48055b2ae3dfd85ebf9.png)
A phenomena called internal covariate shift happens, whenever there is a change in the input distributions to the network. So, when this input distribution changes, hidden layers try to learn to adapt to the new distribution. This actually slows down the training process and kills much of the time, which we are always worried off while training the model ![:slight_smile: :slight_smile:](https://emoji.discourse-cdn.com/google/slight_smile.png?v=12)
In order to maintain the similar distribution of the data, we use batch normalization technique to normalize the outputs using mean=0, std=1. Using this technique, the model is trained faster and the accuracy too increases as compared to the model where we don’t use batchnorm.
In practice, we add batchnorm after the activation functions of the output layer or before the activation functions of the input layer.
Now, what N varma has tried to ask, during this whole process, does batchnorm impact the derivatives too? The back propagation step of batch normalization computes the derivatives of gamma(dgamma) and beta(dbeta). The gamma is used to scale the normalized value and the beta is used to shift then up or down, eliminating the need for bias).
The original paper doesn’t mention anything on how to learn them. But yes, the paper as mentioned by Alva Roramajo does say that it makes the optimization landscape significantly smoother without having any much impact on internal covariate shift.