In this graph, we want to find the value of w, for which J(w) is the minimum.
Also, d(J(w))/d(w) - can also be stated as the rate of change in function J(w) w.r.t w (if I am not wrong)
and, the equation to update w, is
w = w - alpha * d(J(w))/d(w)
What I cannot understand is why are we using d(J(w))/d(w) to update w.
d(J(w))/d(w) tells us how function J(w) changes as w changes. So how can we use that term to update ‘w’.
I think @TMosh has made the point. Rather than the magnitude part which is modified by the learning rate, I think we are solely relying on the slope to tell us the sign for whether to increase or decrease the weight to get it closer to a cost minimum.
If we are worrying about w and dJ/dw being in different units, let’s not forget we still have the learning rate (“unit of w per unit of slope”) to get the unit back right.
@TMosh Thanks for helping me understand the equation from a different context. Instead of thinking about how cost function changes wrt w, if I think of in which direction to move to get to a local minimum wrt w, the parameter update equation makes complete sense.
@rmwkwok You actually figured what my real contention was with- which was the unit. I didnt think about the learning rate (assuming it was just a constant). Thank you.