In week3, video 4: Normalizing Activations in a Network. In the beginning, it mentions the following:
Batch normalization makes your hyperparameter search problem much easier, makes your neural network much more robust.
However, I don’t completely get how batch normalization helps us to hyperparameter search? It is just because the training process is faster or it is another thing. Please, someone could elaborate on this. I will really appreciate it.
Let’s think about the input data. If input data shape is quite “unbalanced”, like x_1 =[0,1] and x_2=[-100000,100000], we usually make it “normalized” to have a similar distributions for those. We also shuffle input data to lower the chance of bias. Otherwise, the next step is quite unstable with having much influence from x2 value or bias data.
This is true for hidden units in a neural network. Let’s consider that there are 2 hidden layers. The input to the 2nd hidden layer is, of course, the output from the first hidden layer. If this output is quite “unbalanced”, then, the operation in the 2nd hidden layer units become unstable. This makes us difficult to get the best hyper-parameters easily and quickly. To avoid that situation, we “normalize” the output values of previous hidden units with a mean 0 and a variance 1. Then, with having a normalized input, the 2nd hidden layer gets stable. (From an algorithm view point, we do not use this normalized value as is. In stead, the authors of original paper introduced \alpha and \beta to shift and scale normalized data.)
If you watch the another video “Why does Batch Norm work?”, you may find additional insights with “covariate shift”, which has different input distributions.
Personally, I like BatchNorm to put into my neural network.
1 Like
Hey @Oscar_Guarnizo,
Welcome to the community. Allow me to add on to what Nobu has answered. Consider any simple 2-layer neural network (NN), and try to answer the question, what are the different hyper-parameters in a NN that we can try to search. A lot comes to mind, some of them being:
- Number of layers
- Number of neurons in each layer
- Activation functions
- Different types of initialization of weights
Now, let’s say that we don’t normalize our data, and find the train set error to be 20%. In this case, we don’t know whether we should associate this high error with the inherent differences in the inputs (for instance, different magnitude scaling as defined by Nobu) or to the fact that our NN has only a few layers, and hence, is a very simple model, and more layers need to be added, or one of the other 100 ways.
Therefore, normalizing your inputs is a great way to ensure that the model’s performance doesn’t depend on the inherent differences of the inputs, and this is how, Batch Normalization (BN) makes the problem of hyper-parameter search easier.
And if you are wondering why not simple normalization works, then as Nobu mentioned, BN has scaling and shifting parameters for each of the layers, which are a great way to ensure that the data’s distribution is reserved. Other advantages of BN can be seen in the lecture video mentioned by Nobu. I hope this helps.
Regards,
Elemento
2 Likes
Thank you so much @anon57530071. This explanation was really great.
Thank you so much @Elemento. I think that you complement perfectly the explanation of @anon57530071. It is really helpful and now I understand.
2 Likes