Hi! I am wondering for exponentially weighted averages, why cannot we initialize v_0 = \theta_1 instead of 0 so that the curve can pass through the first point even without bias correction? Thanks for any help!
1 Like
Hi, @xuanghdu0.
One is equivalent to having seen an infinite sequence of 0s, the other is equivalent to having seen an infinite sequence of \theta_1.
Doesn’t explicitly accounting for the bias introduced intuitively make more sense? (I could be wrong.)
All exponential moving averages initialized with Tensors are initialized to 0,
and therefore are biased to 0. Variables initialized to 0 and used as EMAs are
similarly biased. This function creates the debias updated amount according to
a scale factor, as in (Kingma et al., 2015).
To demonstrate the bias the results from 0-initialization, take an EMA that
was initialized to `0` with decay `b`. After `t` timesteps of seeing the
constant `c`, the variable have the following value:
```
EMA = 0*b^(t) + c*(1 - b)*b^(t-1) + c*(1 - b)*b^(t-2) + ...
= c*(1 - b^t)
```
To have the true value `c`, we would divide by the scale factor `1 - b^t`.
(source)
Take a look at section 3 of the cited paper if you’re interested in the complete derivation.
3 Likes