Hello Venkat @Venkat_Subramani,
Consider the “step size” of w[0]
to be proportional to the learning rate \alpha. If w[0]
currently is 5 units (an arbitary unit) away from the minimum, and the sizes of the next few steps are -
-
[2, 1, 0.5, 0.2, 0.1, .... ]
, then we are walking (converging) towards the minimum without overshooting (like “Step 14”)
-
[7, 6, 5, 3, 2.5, ...]
, then we are still walking (converging) towards the minimum but with overshooting (like “Step 12”)
-
[10, 20, 40, 80, 100, ...]
, then we are actually walking away (diverging) from the minimum and with overshooting (like “Step 9”)
These exercises tell us the three possible cases when selecting a learning rate, and the symptoms of each case in terms of how the step sizes are changing over iterations of gradient descent. If the steps are decreasing, it is likely we are converging; however, if the steps are increasing, it is likely we are diverging.
We want step sizes to be smaller as we get close to the minimum.
In the cases of “Step 14” and “Step 12”, since the step sizes appear to be smaller and smaller over iterations, it is likely to converge any time, so if we want to see if closer to the minimum, we, as you said, need more iterations. However, it is unlikely for us to see exactly it reaches the minimum because as we said, again, that the step sizes will be smaller as it is getting close the minimum, so it might be that the step sizes become too small for it to reach exactly the minimum point. However, usually, a close enough w[0]
to the minimum is sufficient instead of the minimum itself.
When the lab used 9.9e-7, 9e-7, 1e-7, our features had not been scaled, so we needed a very small learning rate when some of our features span a large range (visit the lecture again for the concept of feature scaling). However, if you continued down the lab, the lab then scaled the features, and it converged even at a learning rate of 1e-1! Let me quote from the lab, but I suggest you to check out the lab again for full text:
The scaled features get very accurate results much, much faster! . Notice the gradient of each parameter is tiny by the end of this fairly short run. A learning rate of 0.1 is a good start for regression with normalized features.
Lastly, I believe Andrew’s number assumed we have scaled the features, so we would not be thinking about 9.9e-7, 9e-7, 1e-7. More importantly, the key idea behind Andrew’s number is just how to jump from a value to another when trying. By jumping from 0.001 to 0.01 through 0.003, it takes only two trials to go from 0.001 to 0.01. In other words, if we want to try between 0.001 to 1., then it is only going to take a few trials instead of many trials. In short, it saves our time from trying too many learning rates.
9.9e-7 is not too high compared to 0.001 (Andrew’s numbers), but because, as I said, the dataset had not been scaled, the proper range of learning rate had to be pressed down and consequently, 9.9e-7 is actually already too high.
Cheers,
Raymond