Momentum doubt

faber_soaks · January 16, 2024, 3:54pm

Hi community,

In Lecture Andrew says,
If you average out these gradients, you find that the oscillations in the vertical direction will tend to average out to something closer to zero. So, in the vertical direction, where you want to slow things down, this will average out positive and negative numbers, so the average will be close to zero.
Question 1 ) whats vertical direction? and what are these positive and negative numbers, that tend to get average out?

Whereas, on the horizontal direction, all the derivatives are pointing to the right of the horizontal direction, so the average in the horizontal direction will still be pretty big.
Question 2) whats horizontal direction? and what derivatives are pointing to the right of the horizontal direction? how will this be big

paulinpaloalto · January 16, 2024, 4:05pm

The diagram is showing the “contour lines” on the solution surface projected down onto a 2D plane. The point here is that the surface is not symmetric: if you visualize it in 3 dimensions, one axis is “squashed” relative to the other.

Of course this is ridiculously simplified compared to what is happening in real networks. The fundamental issue is that the dimensions here are the number of parameters and even the relatively small sample networks we deal with here have hundreds or thousands of parameters. Unfortunately it’s impossible to draw diagrams in 100 dimensional space. Then “real” solutions typically have millions or even billions of parameters. GPT-4 has 1.76 trillion parameters apparently. Good luck trying to visualize what the solution surface looks like for that cost function. Here’s a paper from Yann LeCun’s group about visualizing solution surfaces that’s worth a look.

paulinpaloalto · January 16, 2024, 7:49pm

On the point about the vertical versus horizontal movement, notice that the reason that the gradient arrows point in the wrong direction (more vertical than horizontal) is precisely because of the “squashed” shape of the contours. The gradients are perpendicular to the tangent to the contour lines at any point. The contour line is the direction of zero change, so the direction of maximal change (either increase or decrease) will be perpendicular to that. Of course another possible solution to that asymmetry would be to normalize the inputs.

Topic		Replies	Views
Week 2 RMSprop intuition Improving Deep Neural Networks: Hyperparameter tun	5	617	May 11, 2022
Gradient descent with momentum Improving Deep Neural Networks: Hyperparameter tun	3	571	August 15, 2022
Figure 3 caption in the Exercise Improving Deep Neural Networks: Hyperparameter tun	2	540	September 15, 2021
Week 2 Quiz Grader did wrong evaluation for 1 question Improving Deep Neural Networks: Hyperparameter tun quiz-help , week-2 , grader-feedback	2	346	May 15, 2024
Today's diagram (and two questions): The three gradient descent approaches we have seen Improving Deep Neural Networks: Hyperparameter tun week-2	11	56	March 13, 2025

Momentum doubt

Related topics