First, I recognize that there are other related posts, but they don’t seem to answer this question or ponder this alternative to adjusting \alpha. If they do, I apologize…

In this video, at 7:33 Andrew says this:

“Finally, I just want to mention that if you read the literature on gradient descent with momentum often you see it with this term omitted, with this 1 minus Beta term omitted. So you end up with vdW equals Beta vdw plus dW. And the net effect of using this version in purple is that vdW ends up being scaled by a factor of 1 minus Beta, or really 1 over 1 minus Beta. And so when you’re performing these gradient descent updates, alpha just needs to change by a corresponding value of 1 over 1 minus Beta. In practice, both of these will work just fine, it just affects what’s the best value of the learning rate alpha. But I find that this particular formulation is a little less intuitive. Because one impact of this is that if you end up tuning the hyperparameter Beta, then this affects the scaling of vdW and vdb as well. And so you end up needing to retune the learning rate, alpha, as well, maybe. So I personally prefer the formulation that I have written here on the left, rather than leaving out the 1 minus Beta term. But, so I tend to use the formula on the left, the printed formula with the 1 minus Beta term. But both versions having Beta equal 0.9 is a common choice of hyperparameter. It’s just at alpha the learning rate would need to be tuned differently for these two different versions.”

This doesn’t quite make sense to me. You are increasing the gradient and to compensate you should want to reduce the learning rate, no? So instead of setting

\alpha \leftarrow {\alpha \over (1-\beta)},

which makes \alpha bigger, don’t you want

\alpha \leftarrow \alpha \beta?

Also, I am not finding any papers that perform this simplification. I probably would find the answer there, I suppose.

Here’s a little flash card I wrote (amidst many cards I might donate to this group…very helpful) to say what I think is correct. Can someone mentorish render an opinion if it seems right, or wrong and why?