Why do we run BatchNormalization after Conv2D?

In Week 1 assignment 2 we are asked to add a BatchNormalization layer after Conv2D. This was not introduced in lecture. When batch normalization was introduced several courses back we learned to normalize the input. Should we be normalizing at the input too? as well as after every convolution? What is common practice as learned in research/ and applications that I can read about?

Thank you

Hey @moose,
The key think to note here is that although both Batch Normalization and Input Normalization are normalizing, the reason behind the use of each of them differs to some extent.

Normalization of inputs is done in order to make sure that the optimization algorithms don’t have to make steep updates in one direction, and small updates in other, i.e., it is done to ensure that the updates of gradient descent are more or less uniform in each direction. The figure depicting a circular contour plot of the cost function may come to your mind from the course.

On the other hand, Batch Normalization is performed in order to make the updates independent of the “different distribution statistics” followed by each batch of the inputs. Since different batches of inputs might have different mean and variance, using Batch Normalization, we can ensure that all of the batches follow the same distribution, and we can use the learned statistics and the parameters to govern the distribution followed by the test samples as well (to some extent).

So, using both Input Normalization and Batch Normalization is not something that might hurt your model. If anything, chances are that your model’s convergence will be boosted.

Now, when to use Batch Normalization, for instance, after every Conv layer, after every alternate Conv layer, varies from models to models. For that, you can try to take a look at the famous model architectures, and see how they have used Batch Normalization throughout their architecture.

I hope this helps.


Or maybe we can restate this question as just about the point of view you take: the point is that the output of a Conv layer is also the input to the next layer, right? So we’re not applying BatchNorm to the output of the Conv layer: we’re applying it to the input of the next Conv layer or whatever the next layer is. In other words, it’s just a question of perspective. Of course it’s all equivalent, but we just have to look at the goal in the right way to understand the point.


Interesting insight @paulinpaloalto Sir, thanks for sharing it :nerd_face:


1 Like