batch normilization try to reduce the proplem of covariance shift but can this lead to bias i mean the model has trained on black cat and if we test on white cats with batch normilization the distribution of white cat will be diffrent from the distribution of black cat (as i understand) but batch normilzation will reduce these diffrence in hidden layers so if i give the network a white or black dog the

the batch normilization will try to reduce the diffrence in hidden layers so it will lead to wrong prediction

Hey @Ibrahim_Mustafa,

The aim of batch-normalization is to **reduce the shift between the distributions of samples across the batches**, and not to reduce the shift in the distributions of samples belonging to different classes. And that is why, in batch normalization, we have **trainable parameters**, using which we can make sure that the samples across different batches have almost the same distributions, but the samples belonging to different classes will still have differing distributions, although the new distributions might be different from the original ones, since, we are now using the new statistics, i.e., the mean and the variance. Let us know if this resolves your error.

Cheers,

Elemento

Iam sorry but i didnâ€™t understand could you please explain with example

Hey @Ibrahim_Mustafa,

Sure, let me use an example, in which we will be considering samples belonging to 2 mini-batches and 3 different classes. Note that both the mini-batches have samples of all the 3 different classes (*though the number of samples of each class in each mini-batch might be different*).

Letâ€™s say that the statistics in the form of `(mean, variance)`

for 1st mini-batch is `(2.01, 1.98)`

and the for the second mini-batch is `(3.01, 0.91)`

. The difference in statistics could be due to various reasons, for instance, the sources of majority of samples in the 2 mini-batches might be different. Now, with the help of **Batch Normalization**, we might be able to bring these distributions closer to each other, for instance `(0.51, 1.02)`

and `(0.53, 1.05)`

. In the absence of trainable parameters, both the distributions would be of the form `(0, 1)`

.

Now, the thing to note here is that in this entire example, we havenâ€™t talked about the distributions of the different classes, because we donâ€™t need to. When we will scale the distributions of the entire mini-batches, then the distributions of individual classes will be scaled automatically. Earlier the distribution of say samples belonging to classes â€śAâ€ť and â€śBâ€ť could be `(1.52, 1.21)`

and `(2.61, 2.93)`

, and now after batch-normalization, the distributions could be `(0.37, 0.32)`

and `(0.89, 1.01)`

.

As we can see that the distributions belonging to different classes are still different from each other, though they have been scaled in accordance. I hope this examples resolves your query.

Cheers,

Elemento