Why do we use exponentially weighted average to find the mean of **μ** and **σ²**. Why not use something like 1/T * (Sum of all **μ** / **σ²**)?

Wouldn’t the value taken from exponentially weighted average tend to lean more towards the last value and the last few values of **μ** and **σ²**? If yes then we know that it’s not really the mean of **μ** and **σ²**, then why do we use it?

To explain what I mean, lets say in this graph we change **temperature** with **μ** and we change **days** with **t (#mini batch)**. We can see that if the exponential weighted average of all batches vary, the value that we get from exponential weighted average will depend on the mini-batch number. This is different from taking mean of **μ** and **σ²**.

It takes less memory (in fact, a fixed size of memory), and can forget about the old values that might no longer be relevant since the weights have been updating?

Cheers,

Raymond

In addition to Raymond’s point about memory usage, note that with EWAs this behavior is tunable: how much the values are weighted towards the most recent behavior is all controlled by the \beta value you select, right? So if the value you picked isn’t working very well, you need to consider adjusting it.

@paulinpaloalto @rmwkwok

Thank you both for answering. From what you have said, I’m assuming that taking the mean **( 1/T * (Sum of all ****μ** / **σ²**) ) is also something that we can do, which also gives much more accurate value to use at test time.

But due to memory efficiency and due to very small and trivial difference between using the mean and EWAs value during test time, it is okay to use EWAs value. Is that right?

I still have my two concerns.

Also, it is hard to say for sure whether the difference is small. It might be small if we use a very very small learning rate so that it converges very slowly so that there are sufficiently large number of samples that are in the stable range. However, we don’t use unnecessarily small learning rate because time is a resource we want to save.

It is also a question of how one would measure the accuracy. The accuracy is not precisely defined.

Your equation is increasing the contribution of the values from the first training step where the weights that produces the values are randomly generated. Why increasing that will make things better?

I think we need experiments to get some feelings. For example, whether or not the distribution of the values shift over training steps.

Ahh right, I was missing this very crucial point. I was thinking that **μ** and **σ²** are not influenced by the weights, as they are calculated on the basis of input features. I was assuming that the input features do not change their value but what I just realized is that it’s not the case for the deep hidden layers. The input features **(z[l-1])** of deep hidden layers are subject to change.

So it makes sense now that we may use EWAs, to influence the values of **μ** and **σ²** for the test time, by the values of **μ** and **σ²** in the last of iterations, where the values of input features of the hidden layers were more stable.

Thanks for making it clear to me.

I see. I am happy it gave you a different angle! It was indeed going to be a very interesting experiment and, in the future, if you somehow consider to do it, I think my angle can help add something to monitor for when you plan the experiment.

Cheers,

Raymond

1 Like