We usually consider about two aspects, i.e, space complexity and time complexity. For the space complexity, EWMA needs only two (\beta and v_{t-1}) and gets one (\theta_t) to calculate v_t. Sliding-Window uses multiple data like (\theta_t, \theta_{t-1}, \theta_{t-2},\theta_{t-3}, …) to calculate v_t. It is not a big difference, but from the space complexity view point, EWMA has small advantages.
From the time complexity view point, I think both are similar. But, as you pointed out, if we consider the bias term which includes \beta^{t}, then, this can be disadvantage. But, this bias term has some meanings in the early phase, and not in the most of points since (1-\beta^t) becomes to 0 quickly and can be neglected. So, we may be able to say, the time complexity is also similar.
I think the key point that Andew wants to talk is, the idea of EWMA is the base for other optimization algorithms, which can provide better performance than simple gradient decent.
As you see in this course, you learned;
- Gradient descent
- Momentum
- RMSprop
- Adam
For Momentum, Andrew said,
In one sentence, the basic idea is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead.
Actual equations (for W only) are;
V_{dw} = \beta V_{dw} + (1-\beta)dw
W = W - \alpha V_{dw}
Yes, EWMA is applied to gradients.
As you see, “Momentum” focuses on the gradient. On the other hand, RMSprop focuses on the another important factor, the learning rate. The idea is “change the learning rate depending to the gradient.”
S_{dw} = \beta S_{dw} + (1-\beta) dw^2
W = W - \alpha\frac{dw}{\sqrt{S_{dw}+\epsilon}}
Yes, EWMA is also key part of this equations.
Adam is actually a mixture of “Momentum” and “RMSprop”. Andrew introduced bias correction in here, which made some equations complex. But, here it is. (again W only for simplification)
V_{dw} = \beta_1 V_{dw} + (1-\beta_1)dw\ \ \ \ \ \ (Momentum)
S_{dw} = \beta_2 S_{dw} + (1-\beta_2)dw^2\ \ \ \ (RMSprop)
V_{dw}^{corrected} = \frac{V_{dw}}{(1-\beta_1^t)}, \ \ \ S_{dw}^{corrected} = \frac{S_{dw}}{(1-\beta_2^t)}\ \ \ \ (Including bias correction)
W = W - \alpha\frac{V_{dw}^{corrected}}{\sqrt{S_{dw}^{corrected}+\epsilon}}
Again, I suppose what Andrew introduced is the idea of EWMA is important to understand major optimizer algorithms.