In the Lab, Alpha is lower than discussed in video


  1. In the Step 9, Lab C1_W2_Lab03_Feature_Scaling_and_Learning_Rate_Soln, Alpha is set way low to 𝛼 = 9.9e-7.
    And the result shows the learning rate is too high. The solution does not converge. Cost is increasing rather than decreasing.

  2. Step 12 Lowers to 𝛼 = 9e-7 . Result was Cost is decreasing throughout the run showing that alpha is not too large.

  3. Step 14: alpha$ = 1e-7 , Cost is decreasing.


  1. In Step 12, why does the cost vs w[0] chart oscillates between the two arms of parabola? What is the interpretation?

  2. A very slight reduction results in lowering effect, but does it not reach minima? or needs more number of iterations?

  3. I am bit confused with what we learnt from the videos, where Andrew explained the rationale for setting Alpha to be low from 0.001 then to 0.003, then 0.01… And here the numbers are way too high!
    Secondly the starting scenario is not too high, yet the cost funciton spirals up. Can you pls explain whats happening here?

Thank you,

Hello Venkat @Venkat_Subramani,

Consider the “step size” of w[0] to be proportional to the learning rate \alpha. If w[0] currently is 5 units (an arbitary unit) away from the minimum, and the sizes of the next few steps are -

  • [2, 1, 0.5, 0.2, 0.1, .... ], then we are walking (converging) towards the minimum without overshooting (like “Step 14”)

  • [7, 6, 5, 3, 2.5, ...], then we are still walking (converging) towards the minimum but with overshooting (like “Step 12”)

  • [10, 20, 40, 80, 100, ...], then we are actually walking away (diverging) from the minimum and with overshooting (like “Step 9”)

These exercises tell us the three possible cases when selecting a learning rate, and the symptoms of each case in terms of how the step sizes are changing over iterations of gradient descent. If the steps are decreasing, it is likely we are converging; however, if the steps are increasing, it is likely we are diverging.

We want step sizes to be smaller as we get close to the minimum.

In the cases of “Step 14” and “Step 12”, since the step sizes appear to be smaller and smaller over iterations, it is likely to converge any time, so if we want to see if closer to the minimum, we, as you said, need more iterations. However, it is unlikely for us to see exactly it reaches the minimum because as we said, again, that the step sizes will be smaller as it is getting close the minimum, so it might be that the step sizes become too small for it to reach exactly the minimum point. However, usually, a close enough w[0] to the minimum is sufficient instead of the minimum itself.

When the lab used 9.9e-7, 9e-7, 1e-7, our features had not been scaled, so we needed a very small learning rate when some of our features span a large range (visit the lecture again for the concept of feature scaling). However, if you continued down the lab, the lab then scaled the features, and it converged even at a learning rate of 1e-1! Let me quote from the lab, but I suggest you to check out the lab again for full text:

The scaled features get very accurate results much, much faster! . Notice the gradient of each parameter is tiny by the end of this fairly short run. A learning rate of 0.1 is a good start for regression with normalized features.

Lastly, I believe Andrew’s number assumed we have scaled the features, so we would not be thinking about 9.9e-7, 9e-7, 1e-7. More importantly, the key idea behind Andrew’s number is just how to jump from a value to another when trying. By jumping from 0.001 to 0.01 through 0.003, it takes only two trials to go from 0.001 to 0.01. In other words, if we want to try between 0.001 to 1., then it is only going to take a few trials instead of many trials. In short, it saves our time from trying too many learning rates.

9.9e-7 is not too high compared to 0.001 (Andrew’s numbers), but because, as I said, the dataset had not been scaled, the proper range of learning rate had to be pressed down and consequently, 9.9e-7 is actually already too high.


Hi Raymond,
Thank you so much for explaining. You drilled the concept well in me.
I get it now.
Thanks once again.
Venkat S