Hi, on week 3, Andrew mentions that we normalize each input feature (or deep feature) across the batch examples. I understand how that helps when we have batch_size = 64 or more, because their mean and variance would be generally similar to the entire training set with some noise.
However, as our mini-batch size approaches one, doesn’t this make it too noisy? If we have batch_size = 4 for example, it is more likely that their features distribution is quite different from the training set distribution, and if batch_size = 1 in case of SGD, z_norm will be always a vector of zeros if we applied the same formula!
Also, for testing time, why do we use exponentially weighted averages to estimate a general (mean, variance)? doesn’t this mean that early processed batches will have less contribution to the final (mu_test, segma_test) values? If the training set was not properly shuffled before creating the batches, (mu_{t}, segma_{t}) will be quite different across t, and in this case, isn’t this considered a problem since (mu_test, segma_test) calculation will depend mainly on only the latest processed batches ( t={T-10,T-9,…,T} if beta=0.9 ) while the early ones ( t={1,2,…,T-11} ) are mostly ignored?