Diagram for standard “gradient descent”
Note that one can tune the learning rate dynamically, probably automatically depending on how good the progress currently is (maybe good old control theory has something to say about that … but probably not as we don’t even know how the W landscape actually look like)
(original diagram removed to be replaced by the one at the end of this thread)
Diagram for standard “gradient descent with momentum”
Whereby we tame high frequency changes in the gradients by computing a running average of of the gradient before updating the weights (interestingly, as it is an average, it is always ‘running behind a few steps’ the value of the current gradient, but this may not be a problem)
(original diagram, which was wrong, removed to be replaced by the one at the end of this thread)
Diagram for standard “gradient descent with ‘root mean square propagation’”
Whereby we tame the gradient’s distribution of components by dividing each component individually by a running average of that component’s norm.
Question: Could we just keep the running average of the dW
absolute magnitude rather than its squared Frobenius norm that then needs to be sqaurerooted again? It seems more efficient.
(original diagram, which had problems, removed to be replaced by the one at the end of this thread)
Another question: In RMSprop gradient descent, the individual components of the gradient are divided by an individual running average of their respective magnitude. Doesn’t this “destroy” (decohere?) the gradient by scaling every component individually?
I asked ChatGPT and it says:
Why This Doesn’t Destroy the Gradient
- Normalization Helps with Conditioning: If different parameters have very different magnitudes, normalizing by the running average of past squared gradients helps keep updates consistent across all dimensions.
- Gradient Direction is Maintained: Although each component is scaled differently, the direction is mostly preserved because RMSprop does not introduce arbitrary distortions—only a per-dimension adaptive learning rate.
- Prevents Vanishing/Exploding Steps: Without this adaptive scaling, certain parameters could receive steps that are too large or too small, leading to slow convergence or divergence.
When Might It Be a Problem?
If the running average becomes too small for some components, those directions could receive overly large updates. However, the ϵ term prevents division by excessively small values, mitigating this issue.
Hello, David,
In the diagram for standard “gradient descent with momentum”, the symbol v_{dW} is not seen in the update formula. Need some change there?
I think we could, but then the behavior should change. The gradient squared is more oftenly called, in literature, the “second order moment of gradient”. There are researches that study the behavior of Adam with higher order moment instead of the second order, so this is not just about taking the sign away.
I think, by “destroy”, you meant the direction of the gradient is changed (correct me if not). However, if we look at the slide below, bending the blue path to the red path is the whole point, so changing the direction is ok.
Cheers,
Raymond
Thanks Raymond.
1
Yes, I will look into this diagram error. It occurred to me there was another error, but I can’t remember yet what it was 
2
I think we could, but then the behavior should change.
Ah yes, because we are computing the running average of the Frobenius norm, which is then square-rooted.
(update: wrong, we are just computing sqaures of individual elements, for which we then perform exponential running average; no norm is computed)
Yes, using abs()
will change the trajectory of W, but will it change for the worse? It is not evident why dividing by \sqrt{runningavg(||{dW}||_{2}²)} should be less suspect than dividing by runningavg(|dW|_{1}). Empirical studies may be needed. The question was posed to ChatGPT: In neural networks, regarding the “gradient descent with root mean square propagation”, has anyone done research into what happens if one computes the running average with the 1-norm instead of the frobenius norm? , it gave some general pointers but could not find anything directly relevant.
3
Correct, I was wondering whether scaling the various dimensions of the gradient independently would mess up the gradient too much. Taking the gradient vector and scaling all its components independently (by dividing by the square root of the running average of the Frobenius norm) will yield a vector that has component lengths pushed to 1 (from above or below), and it points in a new direction, but remains in the same “quadrant” of the high-dimensional space (i.e. all the signs of the components are retained).
1 Like
Was that error the extra subscript 2 to \beta in the third diagram?
Yes and no. Turns I’m completely wrong.
First, there was a major copy-pase incident in the second diagram, and then I was talking about teh Frobenius norm, but we don’t use any norm. We just use elementwise squaring and elementwise square-rooting.
This doesn’t make the previous discussion completely wrong, however.
Anyway, update:
1 Like
Final update, this time with “adam”.
These diagrams differ a bit from the course in that a new variable is introduced as the output of the “componentwise rescaling”. It is called either v_{reg} or dw_{reg}. This make the diagrams more structure.
There is no v_{dW}^{corr} or s_{dW}^{corr} either, as those are just reassigned v_{dW} or s_{dW}.
Gradient Descent
Gradient Descent with momentum
Gradient Descent with RMSprop
Gradient Descent with adam / adaptive moment estimation
Final the diagrams here:
Hello David, I am not sure about the following description in Adam:

The divisor is the second order moment and so seems not “proportional to … the component’s magnitude”. Also, I am not sure what is being moved closer to 1.
Cheers,
Raymond
2 Likes

The statement about acceleration and velocity was a bit odd to me because both v_dw and dW carry the same unit. Then I did the following:
which shows just how the difference got damped by (1- beta). Though it’s imagainable why you would consider dW as the acceleration, I would be more comfortable with dW - v_dw. Just sharing thoughts.
1 Like
Okay, let me think about that 
Here is the idea:
Take any component of \mathrm{d}W^{l}, let’s say \mathrm{d}w_{i,j}
Then that component will be divided by the square root of a moving average of the squares of that component.
This should be “near” |\mathrm{d}w_{i,j}|
So one would expect both a large and a small component \mathrm{d}w_{i,j} of the gradient to be “near 1” in magnitude (but not sign) after the division by \sqrt{s_{\mathrm{d}W^{l}}}.
I will try to build to plot.
This ratio is 1 if the two betas are both equal to 0, which is not a meaningful case.
if dW is very large such that r_1 and r_2 are both approximately zero, then, from (3), we have
, ignoring
C_t and
\epsilon.
If dW is very close to 0, then, from (1), it’s still not trivial that the ratio becomes 1.
What do you think to write a program, assuming some dW trajectories, such as an exponentially decaying trajectory, and see how the ratio evolves as dW tends to zero? I mean, in a normal situation, the gradient (both before and after applying Adam) should converge to zero as it reaches a optimal point.
Cheers,
Raymond
1 Like
I’m getting lost in my code currently so first, here are the latest versions of the dataflow diagrams, with text that may or may not be 100% correct (I’m being pretty unapologetic about that, those algorithms are happy-go-lucky and the vocabulary is all over the place
luckily they work
… if the weight “landscape” can be successfully accomodating, that is)
Btw. for anyone interested in meta-heuristic algorithms (yet another one of those words, it means “rule of thumb to find a rule-of-thumb”), there is this:
Here we go:
Vanilla Gradient Descent
Gradient Descent “with momentum”
Gradient Descent “with root mean square propagation” (rmsprop)
Gradient Descent “with adaptive moments” (adam)
My own modified Adam
It feels more logical, but I have no idea about the performance (probably not great for reasons)
And also
It might be interesting to see what happens if the moving average exponential decay is replaced by proper averaging. This would be too expensive for billions of parameters but we can always test that in our toy exercises.