DLS 2 Week 2: Gradient Descent with Momentum "simplification"

First, I recognize that there are other related posts, but they don’t seem to answer this question or ponder this alternative to adjusting \alpha. If they do, I apologize…

In this video, at 7:33 Andrew says this:

“Finally, I just want to mention that if you read the literature on gradient descent with momentum often you see it with this term omitted, with this 1 minus Beta term omitted. So you end up with vdW equals Beta vdw plus dW. And the net effect of using this version in purple is that vdW ends up being scaled by a factor of 1 minus Beta, or really 1 over 1 minus Beta. And so when you’re performing these gradient descent updates, alpha just needs to change by a corresponding value of 1 over 1 minus Beta. In practice, both of these will work just fine, it just affects what’s the best value of the learning rate alpha. But I find that this particular formulation is a little less intuitive. Because one impact of this is that if you end up tuning the hyperparameter Beta, then this affects the scaling of vdW and vdb as well. And so you end up needing to retune the learning rate, alpha, as well, maybe. So I personally prefer the formulation that I have written here on the left, rather than leaving out the 1 minus Beta term. But, so I tend to use the formula on the left, the printed formula with the 1 minus Beta term. But both versions having Beta equal 0.9 is a common choice of hyperparameter. It’s just at alpha the learning rate would need to be tuned differently for these two different versions.”

This doesn’t quite make sense to me. You are increasing the gradient and to compensate you should want to reduce the learning rate, no? So instead of setting

\alpha \leftarrow {\alpha \over (1-\beta)},

which makes \alpha bigger, don’t you want

\alpha \leftarrow \alpha \beta?

Also, I am not finding any papers that perform this simplification. I probably would find the answer there, I suppose.

Here’s a little flash card I wrote (amidst many cards I might donate to this group…very helpful) to say what I think is correct. Can someone mentorish render an opinion if it seems right, or wrong and why?

Hi @Greening,

Thank you for taking the time to write this post!

I do think the learning rate should be smaller if you remove the (1 - \beta) term. For what it’s worth, you can play with the update_parameters_with_momentum function in the Week 2 Programming Assignment. If you remove the (1 - \beta) term and scale \alpha by (1 - \beta) you get the same results.

I haven’t seen it referred to as a simplification, but here’s an example of actual usage.

Hope you’re enjoying the course :slight_smile:

3 Likes

Your reasoning is very thoughtful, but in this case, the suggestion to scale \alpha by 1 / (1 - \beta) that Andrew mentions is a way to maintain consistency across formulations because it compensates for the missing (1-\beta) term in the simplified momentum update equation (without (1-\beta)). If you remove this term, you effectively make the velocity update larger because you no longer have the damping effect of (1 - \beta). By increasing \alpha by a factor of 1 / (1 - \beta), you compensate for the fact that the gradient update in the simpler equation is not reduced by (1-\beta). This adjustment ensures that the overall effect on weight updates remains comparable to the original form. However, if you were to use \alpha \leftarrow \alpha \beta, you would actually reduce the learning rate, resulting in slower convergence. The intent here is to keep the overall gradient effect roughly the same when simplifying the momentum update equation by scaling \alpha up to compensate for the missing (1-\beta) factor.

Also, in practice, if you remove the (1 - \beta) term and experiment with scaling \alpha by (1 - \beta) in the code, you will get the same result as @nramon mentioned.

Hope this helps!