Week 3: Why Batch Norm Works

ajpark07 · August 28, 2021, 7:18am

Hello,

I’m a little confused and was wondering if someone could help clarify things to a newbie . I’m not sure I fully understand why normalizing the hidden units helps keep the latter hidden units more robust to changes.

Thanks in advance!

jonaslalin · August 29, 2021, 6:31pm

I think this abstract is good:

The popular belief is that this effectiveness stems from controlling the change of the layers’ input distributions during training to reduce the so-called “internal covariate shift”. In this work, we demonstrate that such distributional stability of layer inputs has little to do with the success of BatchNorm. Instead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.

From

paulinpaloalto · August 29, 2021, 7:30pm

That’s really interesting! Thanks for the link, @jonaslalin. Note that this paper was published almost a year after Course 2 of this series was originally published, meaning that it probably represents an advance in the state of the art in terms of understanding why BatchNorm is effective relative to what is said in Prof Ng’s lectures. Or did any of those lectures get updated as part of the recent “refresh”?

jonaslalin · August 29, 2021, 7:52pm

I watched the lectures recently and didn’t see any updates regarding batchnorm. Prof Andrew Ng is mainly talking about reducing internal covariate shift, since it is discussed as the number one reason in the original paper. I think there are even more papers than the one I linked that build upon these ideas. Batchnorm still has a bit of magic surrounding it

Prof Andrew Ng says

So from the perspective of the third hidden layer, these hidden unit values are changing all the time, and so it’s suffering from the problem of covariate shift that we talked about on the previous slide. So what batch norm does, is it reduces the amount that the distribution of these hidden unit values shifts around.

and

But what this does is, it limits the amount to which updating the parameters in the earlier layers can affect the distribution of values that the third layer now sees and therefore has to learn on. And so, batch norm reduces the problem of the input values changing, it really causes these values to become more stable, so that the later layers of the neural network has more firm ground to stand on.

i.e.,

Sorry for not answering the question more directly I still think you should read the more recent paper as well @ajpark07

jonaslalin · August 29, 2021, 8:54pm

For better understanding:

And if you want to start question the reduction of internal covariate shift:

Welcome down the rabbit hole

ajpark07 · September 2, 2021, 1:55am

Sorry for the late reply, and thank you very much for the detailed response! I love rabbit holes!

NourD · October 26, 2021, 9:35am

Very interesting. Thanks for sharing!

Topic		Replies	Views
Batch Normalization Intuition Improving Deep Neural Networks: Hyperparameter tun	1	574	November 22, 2022
Batch norm usage understand Improving Deep Neural Networks: Hyperparameter tun	1	576	April 30, 2022
Why batch norm works? Improving Deep Neural Networks: Hyperparameter tun	1	513	November 17, 2021
Batch Norm and Covarient shift Improving Deep Neural Networks: Hyperparameter tun week-3	4	20	September 14, 2024
How batch normalization help us to hyperparameter search? Improving Deep Neural Networks: Hyperparameter tun	4	589	May 24, 2022

Week 3: Why Batch Norm Works

Related topics