Batch Norm at Run Time

While watching “Batch Norm at Test Time,” a certain doubt crossed my mind: if dealing with small samples poses an issue during test time, how does this affect runtime performance? In runtime scenarios, we employ one “sample” at a time and calculate the average within this computation (please correct me if I’m mistaken). Could it be said that we are somewhat tailoring our approach to our test/dev samples when utilizing this method? It’s akin to training a neural network that excels only when processing batches. Why not opt to process individual samples during test time to achieve an experience that better reflects real-world conditions?

At test time, we can feed the largest batch possible to estimate model performance.

The issue with dealing with test batch size that doesn’t match training batch size is that \mu and \sigma^2 need to estimated by a process like exponential average or the entire training data to make them robust.

At test time:

  1. Directly compute Z{norm}^{(i)} using the learnt \mu and \sigma^2 from training data.
  2. Compute \tilde{z}^{(i)}.

Since learning these additional variables involves compute and memory, training will take extra time. Test performance should be faster since we are using the learnt \mu and \sigma^2.