I am going through ’ Optional Lab: Feature scaling and learning rate’ and want to make sure I understand what effect feature scaling has on the model.
In simple terms, does it allow us to have a greater learning rate, alpha, when the features are scaled? In the lab example, the house size is in the 000’s, so that is limiting the learning rate that we can use because the cost function derivative will come out very high and would cause J to diverge.
If we have scaled all of our features then it allows us to use a greater learning rate, which means we converge faster, and there is no risk of runaway cost, J (divergence).
Feature scaling is mainly used so that the weights are updated evenly during gradient descent which results in fewer iterations of gradient descent needed to converge to the minimum / maximum.
I’m not certain that as a rule we can say because we have scaled the features we can increase the learning rate alpha.
From the Details section of the lab we see what happens when the features are not scaled. One feature converges quickly to the min / max while the other features take more iterations: