Hi! I am wondering for exponentially weighted averages, why cannot we initialize v_0 = \theta_1 instead of 0 so that the curve can pass through the first point even without bias correction? Thanks for any help!

1 Like

Hi, @xuanghdu0.

One is equivalent to having seen an infinite sequence of 0s, the other is equivalent to having seen an infinite sequence of \theta_1.

Doesn’t explicitly accounting for the bias introduced intuitively make more sense? (I could be wrong.)

```
All exponential moving averages initialized with Tensors are initialized to 0,
and therefore are biased to 0. Variables initialized to 0 and used as EMAs are
similarly biased. This function creates the debias updated amount according to
a scale factor, as in (Kingma et al., 2015).
To demonstrate the bias the results from 0-initialization, take an EMA that
was initialized to `0` with decay `b`. After `t` timesteps of seeing the
constant `c`, the variable have the following value:
```
EMA = 0*b^(t) + c*(1 - b)*b^(t-1) + c*(1 - b)*b^(t-2) + ...
= c*(1 - b^t)
```
To have the true value `c`, we would divide by the scale factor `1 - b^t`.
```

(source)

Take a look at section 3 of the cited paper if you’re interested in the complete derivation.

3 Likes