Professor said “At test time you might not have a mini batch of 64, 32 or 1024, so , you need some different way of coming up with mu and sigma squared. “.
At the same time, professor said “You use an exponentially weighted average to keep track of these values of sigma squared that you see on the first mini batch in that layer, sigma square that you see on second mini batch and so on. “

I understood that the reason we need some different way of coming up with mu and sigma squared at test time is because we might not have a mini batch of 64, 32 or 1024.

My question 1. Then how and why could we use “an exponentially weighted average” of different mini batches while we do have mini batches(I guess there exist mini batches at test time as shown in the lecture)?

My question 2. It seems like we do have mini batches at test-time(as shown in the lecture). Then, why can’t we do calculate the mean and variance with the test-examples in the mini-batches, just like how we do in the train time?

My question 3. If we have to use the exponentially moving avg value to batch-norm at test-time, why don’t we use the exponentially moving avg value at train-time? I think it wouldn’t do any harm and could improve the batch-norm effect.

We use an exponential weighted average to approximate the true mean and standard deviation of the full dataset. Each minibatch contributes to these estimates.

When you are predicting on test, you always use train’s statistics - be it simple transformation or batch normalization. If test statistics significantly differ from train, this means that test is different in general and the model won’t work well. In this case you’ll need to find different training data anyway. But to be more precise - when you train model on data, processed in a certain way, it won’t work well on data, which is processed in a different way.

Interesting idea. Has actually been answered by the authors of batchnorm in their followup paper on Batch Renormalization: