Batch Normalization with Stochastic Gradient Descent

AbdallaRashed · October 4, 2021, 10:28pm

Hi, on week 3, Andrew mentions that we normalize each input feature (or deep feature) across the batch examples. I understand how that helps when we have batch_size = 64 or more, because their mean and variance would be generally similar to the entire training set with some noise.

However, as our mini-batch size approaches one, doesn’t this make it too noisy? If we have batch_size = 4 for example, it is more likely that their features distribution is quite different from the training set distribution, and if batch_size = 1 in case of SGD, z_norm will be always a vector of zeros if we applied the same formula!

Also, for testing time, why do we use exponentially weighted averages to estimate a general (mean, variance)? doesn’t this mean that early processed batches will have less contribution to the final (mu_test, segma_test) values? If the training set was not properly shuffled before creating the batches, (mu_{t}, segma_{t}) will be quite different across t, and in this case, isn’t this considered a problem since (mu_test, segma_test) calculation will depend mainly on only the latest processed batches ( t={T-10,T-9,…,T} if beta=0.9 ) while the early ones ( t={1,2,…,T-11} ) are mostly ignored?

Mahdi_Fatemi · February 27, 2022, 10:59am

Hi AbdallaRashed,
This is my question too. got any answers for it?

Topic		Replies	Views
Week 3 normalization 2 questions Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	558	January 9, 2022
Batch Normalization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	552	May 31, 2021
Week 3 quiz question 8 Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	550	July 19, 2021
An ambiguity about batch normalization at test time Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	522	December 13, 2022
A question about batch normalization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	398	September 5, 2023

Batch Normalization with Stochastic Gradient Descent

Related topics