Week 2 Question - Intent of removing (1 - Beta) from the Gradient descent equation

In the Gradient Descent with Momentum video (Week 2), I didn’t fully understand the rationale for removing the (1−β) term when multiplying by dW. I see how this changes the effective learning rate and scale, but does it matter in practice which formulation we use? Is this just a matter of preference?

Thanks!

I just went back and watched that lecture again to refresh my memory. What I understood him to say is that whether you include the factor of (1 - \beta) on the dW term is just a matter of preference. But he explains that omitting that factor has the effect of making it more complicated if you need to tune the hyperparameter \beta, because you then also have to tune \alpha. He explicitly says starting at offset 8:42 that he personally prefers the first method which just implements the natural version of exponentially weighted averages.

1 Like