Advantage of Exponentially Weighted Average (EWMA) over Sliding Window

When computing EWMA with bias correction, isn’t the computation more costly than for Sliding Window Averages with bias correction? If I understood Andrew correctly, a Sliding Windows Average would also yield better results for the optimization algorithm.

So the only advantage when you do use bias correction is roughly (1/(1-\beta) * |\Theta| less memory for EWMA!? That memory requirement does not sound so bad – why are there no alternative optimization algorithms using a Sliding Window, esp. for ADAM?

We usually consider about two aspects, i.e, space complexity and time complexity. For the space complexity, EWMA needs only two (\beta and v_{t-1}) and gets one (\theta_t) to calculate v_t. Sliding-Window uses multiple data like (\theta_t, \theta_{t-1}, \theta_{t-2},\theta_{t-3}, …) to calculate v_t. It is not a big difference, but from the space complexity view point, EWMA has small advantages.
From the time complexity view point, I think both are similar. But, as you pointed out, if we consider the bias term which includes \beta^{t}, then, this can be disadvantage. But, this bias term has some meanings in the early phase, and not in the most of points since (1-\beta^t) becomes to 0 quickly and can be neglected. So, we may be able to say, the time complexity is also similar.

I think the key point that Andew wants to talk is, the idea of EWMA is the base for other optimization algorithms, which can provide better performance than simple gradient decent.
As you see in this course, you learned;

  • Gradient descent
  • Momentum
  • RMSprop
  • Adam

For Momentum, Andrew said,

In one sentence, the basic idea is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead.

Actual equations (for W only) are;

V_{dw} = \beta V_{dw} + (1-\beta)dw
W = W - \alpha V_{dw}

Yes, EWMA is applied to gradients.
As you see, “Momentum” focuses on the gradient. On the other hand, RMSprop focuses on the another important factor, the learning rate. The idea is “change the learning rate depending to the gradient.”

S_{dw} = \beta S_{dw} + (1-\beta) dw^2
W = W - \alpha\frac{dw}{\sqrt{S_{dw}+\epsilon}}

Yes, EWMA is also key part of this equations.

Adam is actually a mixture of “Momentum” and “RMSprop”. Andrew introduced bias correction in here, which made some equations complex. But, here it is. (again W only for simplification)

V_{dw} = \beta_1 V_{dw} + (1-\beta_1)dw\ \ \ \ \ \ (Momentum)
S_{dw} = \beta_2 S_{dw} + (1-\beta_2)dw^2\ \ \ \ (RMSprop)
V_{dw}^{corrected} = \frac{V_{dw}}{(1-\beta_1^t)}, \ \ \ S_{dw}^{corrected} = \frac{S_{dw}}{(1-\beta_2^t)}\ \ \ \ (Including bias correction)
W = W - \alpha\frac{V_{dw}^{corrected}}{\sqrt{S_{dw}^{corrected}+\epsilon}}

Again, I suppose what Andrew introduced is the idea of EWMA is important to understand major optimizer algorithms.

Hi Nobu,

thanks a lot for your answer. You give a great overview of the lecture and summarize that both time and space complexity are similar between EWMA with bias correction and Sliding Window Averages with bias correction. So

I think Andrew mentioned that Sliding Window yields better results. Right?

Is this what you want to mention ?

If you were to compute a moving window, where you explicitly sum over the last 10 days,
the last 50 days temperature and just divide by 10 or divide by 50, that usually gives you a better estimate.

I think this is a comparison in “moving (sliding) window” method with changing a window size. It’s not a comparison to EWMA. He just wanted to say, to get a better result for “moving window”, a window size needs to be larger. But, this has disadvantage in “space complexity”.

So, the answer to your question is… He did not mention that. So, the answer is No. I think.

By the way, personally, the above Andrew’s statement is slightly questionable.
“Sliding Window” is rather simple, and can easily pick up outliners or obsolete data. So, I suppose there should be best window side depending to the subject.
On the other hand, in the case of EWMA, most recent one has the largest weight. I think this is quite reasonable.

In any cases, from the ML view point, both window size and \beta may be trainable variables, though… :smiling_face: