Using E.W.A. for Estimating Mean and Variance in B.N. at Test Time

In Deep Learning Specialization, Course 2, Week 3, Section Batch Normalization, Video Batch Norm at Test Time, at 2:30 Prof. Andrew Ng. says:

So just as we saw how to use a exponentially weighted average to compute the mean of Theta one, Theta two, Theta three when you were trying to compute a exponentially weighted average of the current temperature, you would do that to keep track of what’s the latest average value of this mean vector you’ve seen.

In the cases where we used E.W.A to estimate some parameters like Theta parameters which were measured temperatures of different days in London or more technically we estimated the amount by which we updated parameters in G.D. with Momentum by computing E.W.A of gradients, we had factor of time which means we give more importance to the latest data in comparison to old data. We gave more weights to recent days temperature of London than the old days temperature, we gave more weights to gradients computed in newly taken steps than the gradients computed in older steps.

But when it gets to mini batches, we can’t say this specific mini batch is more important than other mini batches due to being newer. The mini batches are consisted of training examples and these training examples are distributed into different mini batches randomly so no mini batch can have more importance than the other one, however, when we take the E.W.A. average of means and variances of different mini batches we do the same thing. by taking E.W.A of means and variances of mini batches, we allocate more weights to some mini batches than others and this unequal weight allocation is based on randomness!

  1. Can someone explain me why we do such thing?

My guess is:
It’s right that E.W.A gives more weights to some mini batches randomly but taking E.W.A is much more faster than taking simple mean so we kind of close our eyes on that random weight allocation in order to have a higher speed at computation.

  1. I don’t quite understand the ‘vector’ that Prof. Andrew Ng. is pointing out to in the below sentence (last sentence of the paragraph I quoted above):

what’s the latest average value of this mean vector you’ve seen.

What are the elements of mean vector and how are they arranged in a vector?

But if the minibatches are randomly selected, then they should all have similar statistical behavior, so it doesn’t really matter that you’re giving more weight to the recent minibatches. But if you listen again more carefully, you’ll hear Prof Ng say that this is just one of the possible ways to compute the batch norm parameters: he specifically says that you could just compute them as averages across the entire training set as well. Although if you’re running minibatch that would be a bit of a hassle to implement, which is the point of doing it the way he describes. Note that everything is dynamic here: the output values depend both on the inputs and the (current) parameter values, since we’re looking at the pre-activation outputs of inner layers of the network. So the values “evolve” over the training cycle, which seems like the key reason that using an EWA strategy is relevant: it gives you a way to automatically decay the influence of things that happened farther in the past as the training progresses.

Of course as implied in the above, all this computation is done at training time and the values are saved along with the other model parameters for use when the model is applied. The problem is that at test time, we don’t want to be recomputing those parameters both for efficiency and because we may have very small numbers of input samples or even be simply making a prediction on a single input. So what would it mean to be computing statistical values like \mu and \sigma for a single sample?

The other important high level point here is that no-one but the actual algorithm designers of TensorFlow, PyTorch etc have to worry about this: the rest of us just invoke their APIs and all this magic happens “under the covers”.

1 Like