In Deep Learning Specialization, Course 2, Week 3, Section Batch Normalization, Video Batch Norm at Test Time, at 2:30 Prof. Andrew Ng. says:
So just as we saw how to use a exponentially weighted average to compute the mean of Theta one, Theta two, Theta three when you were trying to compute a exponentially weighted average of the current temperature, you would do that to keep track of what’s the latest average value of this mean vector you’ve seen.
In the cases where we used E.W.A to estimate some parameters like Theta parameters which were measured temperatures of different days in London or more technically we estimated the amount by which we updated parameters in G.D. with Momentum by computing E.W.A of gradients, we had factor of time which means we give more importance to the latest data in comparison to old data. We gave more weights to recent days temperature of London than the old days temperature, we gave more weights to gradients computed in newly taken steps than the gradients computed in older steps.
But when it gets to mini batches, we can’t say this specific mini batch is more important than other mini batches due to being newer. The mini batches are consisted of training examples and these training examples are distributed into different mini batches randomly so no mini batch can have more importance than the other one, however, when we take the E.W.A. average of means and variances of different mini batches we do the same thing. by taking E.W.A of means and variances of mini batches, we allocate more weights to some mini batches than others and this unequal weight allocation is based on randomness!
- Can someone explain me why we do such thing?
My guess is:
It’s right that E.W.A gives more weights to some mini batches randomly but taking E.W.A is much more faster than taking simple mean so we kind of close our eyes on that random weight allocation in order to have a higher speed at computation.
- I don’t quite understand the ‘vector’ that Prof. Andrew Ng. is pointing out to in the below sentence (last sentence of the paragraph I quoted above):
what’s the latest average value of this mean vector you’ve seen.
What are the elements of mean vector and how are they arranged in a vector?