An ambiguity about batch normalization at test time

Hi everybody,
I would be please if confirm my understanding about batch normalization at test time. As I understood, we keep track of mu and sigma squared per each mini-batch. Then, we use them to calculate Z_normalized and Z_tild in the following. For instance, if we have N mini-batches in layer L, we will compute Z_normalized and Z_tild N times for layer L. At the end, we have a vector of Z_tild for various mini-bathces. Am I right?
Best Regards,

Hey @S.hejazinezhad,
There are some very small gaps in your understanding. Let me try to fill those. Just to make sure that we are following the same underlying reference, I will be using the C2 W3 lectures.

Allow me to start from training time, since both “training” and “test” times are inter-linked. So, there are 4 important elements in Batch Normalization \gamma, \beta, \mu and \sigma^2. Now, \gamma and \beta are learnable parameters, so, they behave like any other “weight” during the training and test times, i.e., they get updated via back-propagation during training, and are fixed during testing. So, let’s keep them aside. Additionally, z_{norm} and z_{(i)} can easily be determined once we have these 4 elements using the equations described in the lecture, so I am keeping them aside as well.

Now, the only elements remaining are \mu and \sigma^2. During training, we compute them for each of the mini-batches, and use them to determine the z(s). However, we can’t do the same for testing, due to 2 reasons:

  • First, during testing, we may not have well-defined batches of inputs.
  • Second, which is more important, is that the aim of BN is to normalize the test inputs so that they resemble more closely to the training inputs. But if we use compute the distribution statistics using the test-set samples only, don’t you think, it would simply defeat the purpose of using BN in the first place?

So, for testing, what we do is that we compute the running/moving averages of \mu and \sigma^2 when they are computed during training for each of the mini-batches, and then, we use those estimates during the test time.

Let me know if this helps.