Batch Normalization on Channels

Hello everyone,
Why do we have to run the normalization on channels axis? If the input, say an RGB image, has 3 channels, wouldn’t it be a better approach to normalize the values on each channel separately (i.e. normalization across axes 1 and 2)? or am I getting it wrong?

For each image that passes through a Conv2D layer, the following is the output dimension:[new_height, new_width, num_filters_in_conv]. The last dimension corresponds to channels. When you run batch norm across the channel axis, for every channel, batch norm tracks 4 variables: \beta, \gamma, \mu, \sigma. Please see this link.

I suppose Balaji explained key points already, but here are few additions.

BatchNormalization is to apply “normalization and scale/shift” to the output of previous layer in order to stabilize the neural network. On the other hand, filters which are used in Conv2D (and others) are trying to extract characteristics of image/text. Each filter creates one output, a channel. In this sense, as you know, the number of channels in the output is equal to the number of filters. There will be no RGB channel even from the first Conv2D function.
Then, the next question may be,… is there any advantages or meanings to handle a single RGB signal (say, R) separately ? I think, to extract characteristics from an image, combinations of RGB have more importance than a single channel data. Applying a single filter to all three channels and get one output (channel) should have more meaningful characteristics which can be utilized for image detections and others.

Thank you very much for your response! However, this was not actually what I was looking for. I was wondering why we are normalizing across channels rather than the other dimensions of the convolution output, namely normalizing across the height and width?

This is actually an interesting question. I suppose there are some other research works to find the group to normalize. Batch norm, Layer norm, Instance Norm, Group Norm,…
Here is one figure from the paper: Yuxin Wu, Kaiming He. “Group Normalization”

In some cases, authors could get a better result with Group Normalization. If you are interested in, please take a look at this interesting paper. Hope this helps.