Week 2. Why we multiplying by slope instead of dividing?

notrandomname · May 14, 2023, 3:04pm

Probably a stupid question . On this slide you may observe that slope shows the change rate of J while changing W. But at the bottom of the slide it shows that we should update the data with respect to W, not J.

W := W - alpha * rate of change J

And therefore it seems that in this update we should not multiply the alpha by the rate of change of J (ie, by the slope), but divide it, because the rate of change of W is inversely proportional to the rate of change of J. And if, for example, we took the derivative at a point on a more vertical section of graph, the derivative would be bigger, because at that point, with a tiny change in W, there would be a big change in J.
Can you please explain why this is multiplication and not division?

TMosh · May 14, 2023, 3:08pm

The gradients are based on the derivative of the cost J with respect to each weight ‘w’.

The notation dJ/dw doesn’t mean we’re performing a division. That’s just the notation that calculus uses to indicate what the terms of the derivative are.

notrandomname · May 14, 2023, 3:27pm

Thanks for the quick response. But a small clarification, I mean not the notation of dJ/dw, but a multiplication between learning rate (alpha on the slide) and dJ/dw i.e. why we write

W := W - alpha * rate of change J

instead of

W := W - alpha / rate of change J

Although I now thought that if the error is large (large value of the derivative), then it is necessary to “get out” of this area of the graph faster, and then it would be right to change the values of the parameter W more strongly. And thus it would really be right to multiply. Maybe this is the answer to my question…

paulinpaloalto · May 14, 2023, 3:35pm

Yes, I think your intuition in the last paragraph is right: if a small change in W results in a large change in J, then it means W is very far from what it should be and thus needs a larger, rather than a smaller change.

All this is moderated by the learning rate. The problem is that the derivative is tangent to the curve, of course, which means that if you go too far in that direction, you’re actually off the curve (well, it’s really a surface in high dimensions). Gradient Descent is an iterative approximation method and taking too large a step at any one iteration may cause divergence, so we need to find a learning rate that gives good behavior as we move towards an optimal solution one step at a time. The other high level point to make here is that Prof Ng is just showing us the simplest version of Gradient Descent for pedagogical reasons: this is our first exposure to the ideas and we will be writing the code ourselves in numpy. Later we will graduate to using packaged solutions from TensorFlow which implement more sophisticated versions of the algorithm that use adaptive techniques for managing the learning rate.

notrandomname · May 14, 2023, 3:48pm

Thank you very much, I think you answered quite exhaustively. This little nuance really bothered me.

Topic		Replies	Views
Unable to understand Gradient descent intuition Supervised ML: Regression and Classification week-module-1	4	43	February 8, 2025
Week2 - Derivation for Update function for w(i+1) Neural Networks and Deep Learning week-module-2 , coursera-platform	8	229	January 21, 2024
Gradient descent formula Supervised ML: Regression and Classification week-module-1	8	805	May 27, 2023
MLS_C1_W1_Gradient Descent intuition Supervised ML: Regression and Classification week-module-1	3	494	January 17, 2023
Week 2 : Supervised Machine Learning: Regression and Classification Supervised ML: Regression and Classification	11	732	January 7, 2024

Week 2. Why we multiplying by slope instead of dividing?

Related topics