[Week 2] what is the meaning of (axis = 3) in the BatchNormalization?

Why are we normalizing only in along the channels and not along height and width, for example?

I’m having a hard time mapping the concepts from the Lectures from Pf. Andrew Ng (where I couldn’t grasp any reference to the channels or dimensions of the images) and the application on TensorFlow code.

Can anybody shed a light on this :)?

Many thanks in advance!


So when you Batch Normalise along the channels, what you do is make sure the inputs of the red channel, the blue channel and the green channel are normalised with respect to the batch.

Lets consider an image (Sorry about my drawing skills), a simple 40x40 monochromatic image.
I think it is visible that it is somewhat of a bird.

To simulate using a transformation along the height (would work similarly along the width), I’ve inverted colours of every other column
Pretty hard to tell this was a bird right? And remember this that every other column of this had the same transform happen to it, if we batch normalise along the height, every column will have a different transform apply to them, which will remove even more information than this one had and make it harder for our model to get information from. There’s no need to take information out of our data.

Now lets see what happens when we apply the same transformation along the same image.
As you can see, we still have the entire image data intact with no losses.

Channels are exactly as a monochromatic image, so when we apply the batch normalisation along the channel, it preserves our data while normalising it. Remember, Batch Normalisation is just a transform (much like inversion I did here) so using it per channel will not cause loss of data, unlike normalising along height or width.

Hopefully this helped you visualise the entire thing :slight_smile: . If not feel free to ask clarifications.


OK, So now I get it! we are supposed to normalize the whole image (height and width) on a channel basis and the way to that is to normalize the height and width on all the channels, and that’s what axis = 3 means.

Many thanks to both of you for taking the time to deliver explanations this detailed!

Hi Federico,

Excellent question.

Take as an example the happyModel in Assignment 2 of week 1 (Convolutional_Model_Application). This model is constituted as follows:


The CONV2D layer serves to extract features from the padded image by means of one filter per feature. The application of a filter outputs a 2D array with activation values that support the extraction of a particular feature. These are the values you want to normalize, i.e. per feature, so that the parameters can be learnt faster.

As you may recall from the videos, the 2D arrays per filter/feature are stacked along the channels. So the number of channels equals the number of filters/features. In order to normalize values per feature, you want therefore to normalize along the channel. This is why batchnorm is applied per layer in the channel, i.e. along axis 3.

Hope this helps.


Thanks for the explanation reinoudbosch. Is there any reference that I can read a little bit more about this? Is this a standard way to use batch normalization in CNN? I still a bit confused. I recall that in C2, we normalize Z^(i) to Zhat^(i) across samples in the current batch. Now from what i’ve learned, since Z^(i) is a 2D matrix, this means that we normalize for each components of Z^(i). So the restriction about normalization foe each channels/features seems a bit different from what have taught in C2. Thank you.

Hi kelvinn,

How to use batch normalization depends on the specific purpose you want it to fulfill. In the case of using filters, you want to extract features, so this is what the implementation of batch norm should optimize. For other networks, this may differ. For a general introduction to batch normalization with references to interesting papers you can have a look here.

Thank you very much for the fast response. I’ll try to look at the reference.