Different Mean/Standard Dev values for hidden units

In batch normalization, Andrew said we may not want all hidden units to have a mean-0 and S.D =1 , If i guess correctly, it’s due to " breaK Symmetry"

And if we really wanted larger variance to take advantage of sigmoid’s non-linearity, then why did we even normalize it

Hi @Muhammad_Bin_Usman,

welcome to the community and thanks for your question!

Batch normalization can help to accelerate the training, by aligning batches, so that the training is done more consistently, which should be achieved by tackling the problem of the internal covariance shift (leading to a systematic change in network activations), which is also well outlined in this article: Internal Covariate Shift: How Batch Normalization can speed up Neural Network Training | by Jamie Dowat | Analytics Vidhya | Medium and this paper: [1502.03167] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Can you elaborate a bit more what you mean here specifically? In general: batch normalisation is about ensuring consistency in layer activation so that basically several layers fit well together especially since the weights in the layers change during training and we have different batches of training data. By this we want to make sure the gradient flow works efficiently and gradients are stable (e.g. risk of vanishing gradients is reduced), see also this thread: Vanishing/Exploding Gradients when there is a non-linear activation function - #3 by Christian_Simonis!

Hope that helps.

Best regards