Today's diagram (and two questions): The three gradient descent approaches we have seen

dtonhofer · March 5, 2025, 4:25pm

Diagram for standard “gradient descent”

Note that one can tune the learning rate dynamically, probably automatically depending on how good the progress currently is (maybe good old control theory has something to say about that … but probably not as we don’t even know how the W landscape actually look like)

(original diagram removed to be replaced by the one at the end of this thread)

Diagram for standard “gradient descent with momentum”

Whereby we tame high frequency changes in the gradients by computing a running average of of the gradient before updating the weights (interestingly, as it is an average, it is always ‘running behind a few steps’ the value of the current gradient, but this may not be a problem)

(original diagram, which was wrong, removed to be replaced by the one at the end of this thread)

Diagram for standard “gradient descent with ‘root mean square propagation’”

Whereby we tame the gradient’s distribution of components by dividing each component individually by a running average of that component’s norm.

Question: Could we just keep the running average of the dW absolute magnitude rather than its squared Frobenius norm that then needs to be sqaurerooted again? It seems more efficient.

(original diagram, which had problems, removed to be replaced by the one at the end of this thread)

Another question: In RMSprop gradient descent, the individual components of the gradient are divided by an individual running average of their respective magnitude. Doesn’t this “destroy” (decohere?) the gradient by scaling every component individually?

I asked ChatGPT and it says:

Why This Doesn’t Destroy the Gradient

Normalization Helps with Conditioning: If different parameters have very different magnitudes, normalizing by the running average of past squared gradients helps keep updates consistent across all dimensions.

Gradient Direction is Maintained: Although each component is scaled differently, the direction is mostly preserved because RMSprop does not introduce arbitrary distortions—only a per-dimension adaptive learning rate.

Prevents Vanishing/Exploding Steps: Without this adaptive scaling, certain parameters could receive steps that are too large or too small, leading to slow convergence or divergence.

When Might It Be a Problem?

If the running average becomes too small for some components, those directions could receive overly large updates. However, the ϵ term prevents division by excessively small values, mitigating this issue.

rmwkwok · March 6, 2025, 12:36am

Hello, David,

In the diagram for standard “gradient descent with momentum”, the symbol v_{dW} is not seen in the update formula. Need some change there?

I think we could, but then the behavior should change. The gradient squared is more oftenly called, in literature, the “second order moment of gradient”. There are researches that study the behavior of Adam with higher order moment instead of the second order, so this is not just about taking the sign away.

I think, by “destroy”, you meant the direction of the gradient is changed (correct me if not). However, if we look at the slide below, bending the blue path to the red path is the whole point, so changing the direction is ok.

Cheers,
Raymond

dtonhofer · March 6, 2025, 7:22am

Thanks Raymond.

1

Yes, I will look into this diagram error. It occurred to me there was another error, but I can’t remember yet what it was

2

I think we could, but then the behavior should change.

Ah yes, because we are computing the running average of the Frobenius norm, which is then square-rooted. (update: wrong, we are just computing sqaures of individual elements, for which we then perform exponential running average; no norm is computed)

Yes, using abs() will change the trajectory of W, but will it change for the worse? It is not evident why dividing by \sqrt{runningavg(||{dW}||_{2}²)} should be less suspect than dividing by runningavg(|dW|_{1}). Empirical studies may be needed. The question was posed to ChatGPT: In neural networks, regarding the “gradient descent with root mean square propagation”, has anyone done research into what happens if one computes the running average with the 1-norm instead of the frobenius norm? , it gave some general pointers but could not find anything directly relevant.

3

Correct, I was wondering whether scaling the various dimensions of the gradient independently would mess up the gradient too much. Taking the gradient vector and scaling all its components independently (by dividing by the square root of the running average of the Frobenius norm) will yield a vector that has component lengths pushed to 1 (from above or below), and it points in a new direction, but remains in the same “quadrant” of the high-dimensional space (i.e. all the signs of the components are retained).

rmwkwok · March 6, 2025, 9:17pm

Was that error the extra subscript 2 to \beta in the third diagram?

dtonhofer · March 6, 2025, 11:49pm

Yes and no. Turns I’m completely wrong.

First, there was a major copy-pase incident in the second diagram, and then I was talking about teh Frobenius norm, but we don’t use any norm. We just use elementwise squaring and elementwise square-rooting.

This doesn’t make the previous discussion completely wrong, however.

Anyway, update:

dtonhofer · March 7, 2025, 12:57pm

Final update, this time with “adam”.

These diagrams differ a bit from the course in that a new variable is introduced as the output of the “componentwise rescaling”. It is called either v_{reg} or dw_{reg}. This make the diagrams more structure.

There is no v_{dW}^{corr} or s_{dW}^{corr} either, as those are just reassigned v_{dW} or s_{dW}.

Gradient Descent

Gradient Descent with momentum

Gradient Descent with RMSprop

Gradient Descent with adam / adaptive moment estimation

Final the diagrams here:

rmwkwok · March 8, 2025, 5:55am

Hello David, I am not sure about the following description in Adam:

The divisor is the second order moment and so seems not “proportional to … the component’s magnitude”. Also, I am not sure what is being moved closer to 1.

Cheers,
Raymond

rmwkwok · March 8, 2025, 6:05am

The statement about acceleration and velocity was a bit odd to me because both v_dw and dW carry the same unit. Then I did the following:

which shows just how the difference got damped by (1- beta). Though it’s imagainable why you would consider dW as the acceleration, I would be more comfortable with dW - v_dw. Just sharing thoughts.

dtonhofer · March 8, 2025, 7:34am

Okay, let me think about that

dtonhofer · March 8, 2025, 5:49pm

Here is the idea:

Take any component of \mathrm{d}W^{l}, let’s say \mathrm{d}w_{i,j}

Then that component will be divided by the square root of a moving average of the squares of that component.

This should be “near” |\mathrm{d}w_{i,j}|

So one would expect both a large and a small component \mathrm{d}w_{i,j} of the gradient to be “near 1” in magnitude (but not sign) after the division by \sqrt{s_{\mathrm{d}W^{l}}}.

I will try to build to plot.

rmwkwok · March 8, 2025, 11:04pm

This ratio is 1 if the two betas are both equal to 0, which is not a meaningful case.

if dW is very large such that r_1 and r_2 are both approximately zero, then, from (3), we have

, ignoring C_t and \epsilon.

If dW is very close to 0, then, from (1), it’s still not trivial that the ratio becomes 1.

What do you think to write a program, assuming some dW trajectories, such as an exponentially decaying trajectory, and see how the ratio evolves as dW tends to zero? I mean, in a normal situation, the gradient (both before and after applying Adam) should converge to zero as it reaches a optimal point.

Cheers,
Raymond

dtonhofer · March 13, 2025, 2:37pm

I’m getting lost in my code currently so first, here are the latest versions of the dataflow diagrams, with text that may or may not be 100% correct (I’m being pretty unapologetic about that, those algorithms are happy-go-lucky and the vocabulary is all over the place luckily they work … if the weight “landscape” can be successfully accomodating, that is)

Btw. for anyone interested in meta-heuristic algorithms (yet another one of those words, it means “rule of thumb to find a rule-of-thumb”), there is this:

Here we go:

Vanilla Gradient Descent

Gradient Descent “with momentum”

Gradient Descent “with root mean square propagation” (rmsprop)

Gradient Descent “with adaptive moments” (adam)

My own modified Adam

It feels more logical, but I have no idea about the performance (probably not great for reasons)

And also

It might be interesting to see what happens if the moving average exponential decay is replaced by proper averaging. This would be too expensive for billions of parameters but we can always test that in our toy exercises.

Topic		Replies	Views
Checking Intuition: RMSprop Normalization vs Speed Improvement (Post: RMSprop lecture) Improving Deep Neural Networks: Hyperparameter tun	1	667	October 10, 2022
Week 2 RMSprop intuition Improving Deep Neural Networks: Hyperparameter tun	5	617	May 11, 2022
RMS prop in a favorable setting Improving Deep Neural Networks: Hyperparameter tun	11	780	September 11, 2021
RMSprop can go wrong? Improving Deep Neural Networks: Hyperparameter tun	4	716	April 29, 2023
Momentum descent and RMS Prop Improving Deep Neural Networks: Hyperparameter tun	1	552	May 18, 2022

Today's diagram (and two questions): The three gradient descent approaches we have seen

Diagram for standard “gradient descent”

Diagram for standard “gradient descent with momentum”

Diagram for standard “gradient descent with ‘root mean square propagation’”

Why This Doesn’t Destroy the Gradient

When Might It Be a Problem?

1

2

3

Gradient Descent

Gradient Descent with momentum

Gradient Descent with RMSprop

Gradient Descent with adam / adaptive moment estimation

Vanilla Gradient Descent

Gradient Descent “with momentum”

Gradient Descent “with root mean square propagation” (rmsprop)

Gradient Descent “with adaptive moments” (adam)

My own modified Adam

And also

Related topics