Probably a stupid question . On this slide you may observe that slope shows the change rate of J while changing W. But at the bottom of the slide it shows that we should update the data with respect to W, not J.
W := W - alpha * rate of change J
And therefore it seems that in this update we should not multiply the alpha by the rate of change of J (ie, by the slope), but divide it, because the rate of change of W is inversely proportional to the rate of change of J. And if, for example, we took the derivative at a point on a more vertical section of graph, the derivative would be bigger, because at that point, with a tiny change in W, there would be a big change in J.
Can you please explain why this is multiplication and not division?
The gradients are based on the derivative of the cost J with respect to each weight âwâ.
The notation dJ/dw doesnât mean weâre performing a division. Thatâs just the notation that calculus uses to indicate what the terms of the derivative are.
Thanks for the quick response. But a small clarification, I mean not the notation of dJ/dw, but a multiplication between learning rate (alpha on the slide) and dJ/dw i.e. why we write
W := W - alpha * rate of change J
instead of
W := W - alpha / rate of change J
Although I now thought that if the error is large (large value of the derivative), then it is necessary to âget outâ of this area of the graph faster, and then it would be right to change the values of the parameter W more strongly. And thus it would really be right to multiply. Maybe this is the answer to my questionâŚ
Yes, I think your intuition in the last paragraph is right: if a small change in W results in a large change in J, then it means W is very far from what it should be and thus needs a larger, rather than a smaller change.
All this is moderated by the learning rate. The problem is that the derivative is tangent to the curve, of course, which means that if you go too far in that direction, youâre actually off the curve (well, itâs really a surface in high dimensions). Gradient Descent is an iterative approximation method and taking too large a step at any one iteration may cause divergence, so we need to find a learning rate that gives good behavior as we move towards an optimal solution one step at a time. The other high level point to make here is that Prof Ng is just showing us the simplest version of Gradient Descent for pedagogical reasons: this is our first exposure to the ideas and we will be writing the code ourselves in numpy. Later we will graduate to using packaged solutions from TensorFlow which implement more sophisticated versions of the algorithm that use adaptive techniques for managing the learning rate.