Can someone help me to understand why is it that every time the level of increase/decrease of w gets larger and lager in this example? (the pink line)
Hello @yhuai_lin
Welcome to the community
This is a situation that happens when we choose a high value for the learning rate \alpha
The aim of gradient descent is to keep updating the parameters w and b until we reach the Cost minima. Whenever, we set a high value for \alpha, instead of w taking small steps and moving closer to the minima, it takes bigger steps thereby overshooting the minima.
With reference to the figure shown above, the initial value of w is at the lowest point on the left of the minima. The next update should have brought w a step closer to the minima. However, due to the high value of \alpha, the update makes w overshoot the minima, thereby ending up on the right of the minima. The next update of w again aims to bring it closer to the minima, but in trying to do so, gives it an even larger update and pushes it further away and to the left of the minima. As w keeps getting pushed farther and farther away from the minima, the corresponding cost J(w) increases as well. This is why you see the arrows climbing up on the cost curve.
With every subsequent update of w the following behaviour can be noted:
- w keeps bouncing around from one side of the minima to the other
- the magnitude of increase of w increases with each step.
The update value of w for each step of Gradient descent depends on:
-
\alpha
-
\dfrac{dJ}{dw}
The value of \dfrac{dJ}{dw} is smaller for points on the cost curve near the minima and keeps increasing as we move away from the minima. In the example above, as each update of w pushes it further away from the minima, correspondingly we move higher up on the cost curve and hence the value of \dfrac{dJ}{dw} at the new value of w is higher than it was at the previous step - Consequently, the magnitude by which w gets updated in the next step will increase compared to the previous step.
We can control this unbounded increase of w and cost J(w) for each step of Gradient Descent by setting an appropriately low value for the learning rate \alpha. Prof. Andrew covers this in detail in Week 2 of Course 1
The interesting thing is that we don’t know how much we need to update w to get closer to the minima, we only know the direction of the next update.
Here, we not only overshot the minima right in the first step, but we overshot it further away, than our distance was to the left from the minima before. Just imagine a green marker on the horizontal axis, similar to the blue ones, just below the minimum point. The distance between our imaginary green marker and the right marker (where we jumped) is greater than the distance between the left blue marker (our original position) and the green marker.
This unfortunately this means a larger gradient value of the cost function, considering its absolute value, and thus a larger next jump. And the story goes on like this, the level of increase/decrease gets larger and larger, because of the larger and larger absolute gradient value.
Thanks a lot for the explanation!
You are most welcome @yhuai_lin