Week 3: Why Batch Norm Works


I’m a little confused and was wondering if someone could help clarify things to a newbie :slight_smile:. I’m not sure I fully understand why normalizing the hidden units helps keep the latter hidden units more robust to changes.

Thanks in advance!

I think this abstract is good:

The popular belief is that this effectiveness stems from controlling the change of the layers’ input distributions during training to reduce the so-called “internal covariate shift”. In this work, we demonstrate that such distributional stability of layer inputs has little to do with the success of BatchNorm. Instead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.



That’s really interesting! Thanks for the link, @jonaslalin. Note that this paper was published almost a year after Course 2 of this series was originally published, meaning that it probably represents an advance in the state of the art in terms of understanding why BatchNorm is effective relative to what is said in Prof Ng’s lectures. Or did any of those lectures get updated as part of the recent “refresh”?

1 Like

I watched the lectures recently and didn’t see any updates regarding batchnorm. Prof Andrew Ng is mainly talking about reducing internal covariate shift, since it is discussed as the number one reason in the original paper. I think there are even more papers than the one I linked that build upon these ideas. Batchnorm still has a bit of magic surrounding it :woman_mage: :smiley:

Prof Andrew Ng says

So from the perspective of the third hidden layer, these hidden unit values are changing all the time, and so it’s suffering from the problem of covariate shift that we talked about on the previous slide. So what batch norm does, is it reduces the amount that the distribution of these hidden unit values shifts around.


But what this does is, it limits the amount to which updating the parameters in the earlier layers can affect the distribution of values that the third layer now sees and therefore has to learn on. And so, batch norm reduces the problem of the input values changing, it really causes these values to become more stable, so that the later layers of the neural network has more firm ground to stand on.


Sorry for not answering the question more directly :smiley: I still think you should read the more recent paper as well @ajpark07 :slight_smile:


For better understanding:

And if you want to start question the reduction of internal covariate shift:

Welcome down the rabbit hole :ghost:


Sorry for the late reply, and thank you very much for the detailed response! I love rabbit holes!

1 Like

Very interesting. Thanks for sharing!

1 Like