Why is the following formula used in the exponentially weighted average.

v_t=β×v_(t−1) +(1−β)×θ(t)

And in the second week of programming assignments momentum gradient descent was used.

𝑣𝑑𝑊[𝑙]=𝛽𝑣𝑑𝑊[𝑙]+(1 -𝛽)𝑑𝑊[𝑙]

Why is the first term 𝛽𝑣𝑑𝑊[𝑙] in the formula for momentum gradient descent not 𝑙-1. Then if initially the 𝑣𝑑𝑊 all initialized to 0, wouldn’t the first term be meaningless.

Oh, I misunderstood. In momentum gradient descent, [L] stands for the Lth parameter, not the Lth iteration.

Yes, the l (lower case ell) there is the layer number, nothing to do with iterations.

1 Like