Without batchnorm the deeper layers will learn W and B without outside intervention (we don’t tell the network what the weights should be).
I understand the purpose of batchnorm is to give deeper layers of the network more normalized inputs (the output of shallower layers). But my intuition tells me that we’re starting to interfere with the network internals: instead of learning W and b independently, the network now has to learn W’ and b’ as a result of us meddling with the way it is learning. The result is no different, just that W’ <> W and b’ <> b, so what’s the point?
Hi, djdevilliers. I just saw your question and it’s was the same Q what I have.
I enrolled this course and keep learning. I got same question after take lecture of “Batch Norm” but now I am little bit sure about why it works.
First, we have learned about Input Normalization. The goal of the Input normalization is scaling all the input features to the same or similarly range, right? I think you may understood what the input normalization and why it works well. Andrew attached very intuition figure to understand that one. Once you understand why the Input Normalization works, it will be easy to understand BN(Batch Norm). From perspective of the lth hidden layer, the output of previous layer (l-1) would be think as INPUT.
Second, in the lecture Andrew bit compare the BN and dropout. Dropout is useful method to spread out weights and prevent rely on any one feature. I thought BN has little bit same feature with dropout. Once you compute output of l-1 layer, there may be some extra large value relatively others. In this case, the output of l layer may be depend on those large feature and it will cause overfitting. Once we done BN on output of l-1 layer, we can solve this problem and spread out weights.
Above ones are just my thought and understood from lecture.
If there is any mistake and you disagree with my opinion, please let me know your think.
Hope to hear your idea soon.
Hello @djdevilliers, this category is for the Machine Learning Specialization. Are you a learner from the Deep Learning Specialization asking about DLS course materials in course 2 week 3?
Yes, apologies if I posted in the wrong place. The UI made it very difficult for me to post in the correct place.
It’s fine @djdevilliers, please don’t worry about it. I have moved this thread to the DLS category.
The paper for Batch Normalization [1502.03167] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift is a pretty good read.
And some papers analyzing the results of the first paper: https://proceedings.neurips.cc/paper/2018/file/36072923bfc3cf47745d704feb489480-Paper.pdf
TLDR: the main argument for using Batch Normalization is it allows us to use a larger learning rate without losing accuracy.
There are also alternatives to Batch Normalization such as Group Normalization.
The paper for Group Norm: [1803.08494] Group Normalization
And I like Yannic Kilcher’s explanation of Batch and Group Normalization on YouTube
Hope this helps!
Thank you for the links; useful readings.