I understand the momentum value brought by the exponentially weighted average when sequentially adding pieces of information and giving more weight to the previously processed information as a whole versus to the last piece of information being considered.
But in the case of selecting a mean and variance to normalize the test examples the random sequence of batches used during training does not seem to be relevant. In fact in this case the exponentially weighted average is penalizing for no reason the batches that randomly resulted to be the last ones.
Could a regular arithmetic average of the means and variances of all mini-batches be a better hyper-parameter to be used at test time (and at production time)?